How to build optimized packages

if you compile a package from CCR or AUR then is it possible to optimize it for better performance and size.

for safety you should work with a /etc/makepkg.conf copy

$ sudo cp /etc/makepkg.conf /etc/makepkg-optimize.conf

you are not forced to use /etc or the name for the copy

use the copy with:

$ makepkg --config /etc/makepkg-optimize.conf

next step edit /etc/makepkg-optimize.conf for the optimazions

1.a link time optimization (LTO) for GCC
1.1.a positive: a smaller and faster executed binariy
1.2.a negative: need more ram and time for linking
1.3.a add this in your makepkg copy

CFLAGS+="-flto=auto
CXXFLAGS+="${CFLAGS}"
LDFLAGS+="-fuse-linker-plugin"

https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Optimize-Options
https://gcc.gnu.org/onlinedocs/gcc/Link-Options.html#Link-Options

Link-time optimization is a type of program optimization performed by a compiler to a program at link time. Link time optimization is relevant in programming languages that compile programs on a file-by-file basis, and then link those files together (such as C and Fortran), rather than all at once (such as Java’s “Just in time” (JIT) compilation[ citation needed ]).

Once all files have been compiled separately into object files, traditionally, a compiler links (merges) the object files into a single file, the executable. However, in the case of the GCC compiler, for example, with Link Time Optimization (LTO) enabled, GCC is able to dump its internal representation (GIMPLE) to disk, so that all the different compilation units that will go to make up a single executable can be optimized as a single module. This expands the scope of inter-procedural optimizations to encompass the whole program (or, rather, everything that is visible at link time). With link-time optimization, the compiler can apply various forms of interprocedural optimization to the whole program, allowing for deeper analysis, more optimization, and ultimately better program performance. https://en.wikipedia.org/wiki/Link-time_optimization

1.b link time optimization (thinLTO) for LLVM (LLVM software suit 6 isn’t compiled with lto/thinLTO support)
1.1.b positive: a smaller and faster executed binariy
1.2.b negative: need more ram and time for linking but less then traditional linking
1.3.b add this in your makepkg copy

CFLAGS+="-flto=thin"
CXXFLAGS+="${CFLAGS}"
#LLVMgold
LDFLAGS+="plugin-opt,cache-dir=/path/to/cache,jobs=$(getconf _NPROCESSORS_ONLN)"
#lld as default linker
LDFLAGS+="--thinlto-cache-dir=/path/to/cache,--thinlto-jobs=$(getconf _NPROCESSORS_ONLN)"

1.4 note: LLVM/clang llvm 6.0.0-1 does not support thinLTO because the packages isn’t compiled with the necessary options

2.a graphite (GCC) (graphite is a framework for high-level memory optimizations using the polyhedral model.)
2.1.a https://www.cs.utexas.edu/~pingali/CS380C/2013/lectures/graphite.pdf
2.2.a add this in your makepkg copy

CFLAGS+=" -fgraphite-identity -floop-nest-optimize -ftree-loop-distribution -ftree-vectorize"
CPPFLAGS+="$CFLAGS"
CXXFLAGS+="$CFLAGS"

https://clang.llvm.org/docs/ThinLTO.html#clang-llvmOptions

2.b Auto-Vectorization (clang/LLVM)
2.1.b https://www.cs.utexas.edu/~pingali/CS380C/2013/lectures/graphite.pdf
2.2.b add this in your makepkg copy

#if you don't trust
CFLAGS+="-mllvm -vectorize-loops"
CXXLAGS+="-mllvm -vectorize-loops"
  1. CPU optimization (GCC&LLVM)
    3.1 positive: full CPU register usage
    3.1 negative: works only for this CPU type!
    3.2 add this in your makepkg copy
CFLAGS+=" -march=native"
CPPFLAGS+="$CFLAGS"
CXXFLAGS+="$CFLAGS"

4.a profile guided optimization (pgo) (GCC)
4.1.a positive: should improve runtime perfomance
4.2.a negative: expensive, because it need to compile the program twice and you need a directory with writing permission
4.3.a add this in your makepkg copy
4.3.1.a first compilation:

CFLAGS+=" -fprofile-generate -fprofile-dir=/<yourDirWithWPermisson>/$pkgbase.gen"
CXXFLAGS+=" -fprofile-generate -fprofile-dir=$PROFDEST/$pkgbase.gen"
CPPFLAGS+=" -fprofile-generate -fprofile-dir=$PROFDEST/$pkgbase.gen"
LDFLAGS+=" -lgcov --coverage"

4.3.2.a second compilation

CFLAGS+=" -fprofile-correction -fprofile-use -fprofile-dir=/<yourDirWithWPermisson>/$pkgbase.used"
CXXFLAGS+=" -fprofile-correction -fprofile-use -fprofile-dir=/<yourDirWithWPermisson>/$pkgbase.used"
CPPFLAGS+=" -fprofile-correction -fprofile-use -fprofile-dir=/<yourDirWithWPermisson>/$pkgbase.used"

https://en.wikipedia.org/wiki/Profile-guided_optimization

will be edited soon:
4.b ** Profiling with Instrumentation (llvm/clang)**
4.1.b positive: should improve runtime perfomance
4.2.b negative: expensive, because it need to compile the program twice and you need a directory with writing permission
4.3.b add this in your makepkg copy
4.3.1.b first compilation:

CFLAGS+=" -fprofile-instr-generate=/<yourDirWithWPermisson>/$pkgbase.gen"
CXXFLAGS+=" -fprofile-instr-generate=$PROFDEST/$pkgbase.gen"
CPPFLAGS+=" $PROFDEST/$pkgbase.gen"

4.3.2.b second compilation

CFLAGS+=" -fprofile-instr-generate=/<yourDirWithWPermisson>/$pkgbase.used"
CXXFLAGS+=" -fprofile-correction -fprofile-use -fprofile-dir=/<yourDirWithWPermisson>/$pkgbase.used"
CPPFLAGS+=" -fprofile-correction -fprofile-use -fprofile-dir=/<yourDirWithWPermisson>/$pkgbase.used"
1 Like

This returns the exact number of cpu cores in your system, right? If so, then it’s a problem. On a 12 core Ryzen it will tells to make to use 12 process to compile the code. It sounds good, but that will renders your system mostly unresponsive because of the extreme cpu load.
The right way to handle this (at least in high core count systems) is to set make’s process count to one less than the core count. Unfortunately i was not able to handle this in

CFLAGS+="-flto=$(getconf _NPROCESSORS_ONLN)"

here, because this line did not handle arithmetic operators like _NPROCESSORS_ONLN - 1 etc. AFAIK you can only do it by hand.

1 Like

really? i couldn’t find much difference my two notebooks with a old core2 and i5-3320M …perhaps it’s related to this issue AMD CPU frequency scheduler issues since kernel 4.19
anyway in such case i would use this little program to adjust the nice level

Hmm, this needs some investigation.

When i built something with j12, all of the cores was at 100% utilization and even the mouse pointer lagged.