Do cyborgs dream of bionic sheep? - Packaging in different systems

This is part of a series where I write down things I know or I've learned while trying to keep myself sharp in various subjects that I don't use so often anymore.

This post is quite sparsely structured however due to the lack of posts about the subject of packaging and specifically about how to handle weird conda-forge things, I thought it might be useful for someone anyway!

Operating System and CPU architecture

A given package will only work for a given architecture. Packages compiled to macOS Intel x86 won’t work in the M1 architecture’s ARM64, for example.

Solving linking problems
- Dynamically linked libraries contain functions, classes, variables, and resources that can be used by multiple programs simultaneously. This promotes code reuse and efficient memory usage. They enable 'dynamic linking' where the linking happens at run time when both executable files and libraries are loaded into memory.
  - .dylib is OSX, .so is LINUX and .dll on Windows
- ldd and nm are used to work with binary files. otool is the OSX tool
  - The ldd command prints the shared libraries required by each program or shared library passed to it.
```
ldd /path/to/program
```
  - nm is used to inspect the symbols within object files or binaries and is available as part of the GNU binutils package.
```
nm /path/to/object_file
```
    - Find Symbol Definitions: You can use nm to find out whether a symbol is defined or used in an object file, and whether the symbol is exported (global) or static.
    - Debugging Linking Issues: nm is useful for debugging linking problems by showing which symbols are undefined or multiple defined in an object file.
    - Optimizing Builds: By examining the symbols, developers can get insights into how to optimize their build (for example, by removing unused static functions)
  - otool -L to achieve similar functionality to ldd on macOS. This command lists the dynamic library dependencies of a binary.
- In the context of compiled languages like C and C++, symbols refer to these entities (function, classes, variables..) once the code has been compiled into object files (.o), static libraries (.a), shared libraries (.so, .dylib, etc.), or executables.
  - Undefined Symbols: These are symbols that are referenced in an object file but are defined in a different objec file or library. They are one of the main causes of linking errors.
  Object Files (.o/.obj)
  - Creation: Generated by the compiler when you compile a C/C++ source file without linking it.
  - Purpose: They contain the compiled code of individual source files in a form that can be handled by the linker.
  - Usage: Object files are the building blocks that are put together to create either a static library, shared library, or an executable.
  Static Libraries (.a/.lib)
  - Creation: Created by the archiver tool (like ar in Unix) from one or more object files.
  - Purpose: They serve as a single package of multiple object files, allowing for easier distribution and re-use of library code.
  - Usage: When you link an application with a static library, the contents of the library are copied into the final executable, resulting in no runtime dependency on the library.
  Shared Libraries (.so/.dylib/.dll)
  - Creation: Produced by the linker when building a library that is meant to be shared between multiple executables.
  - Purpose: Contain code and data that can be used by multiple executables simultaneously, promoting code reuse and reducing the memory footprint of applications.
  - Usage: At runtime, the OS loads the shared library into memory, and different programs can use its code and data without having to have their own copy. This requires that the shared library is available during the runtime on the system where the executable is run.

Questions and answers:

How does the operating system use shared libraries at runtime?
- The executable has a list of required libs. The OS gets the paths and loads them into memory. If multiple apps use the same lib they share the in-memory copy.
What must be ensured for a shared library to be used by an executable at runtime?
- The library must be available on the system where the executable is run, typically in a directory known to the dynamic linker (e.g., specified in LD_LIBRARY_PATH on Linux).
- The version of the shared library must be compatible with what the executable was built against.
- Proper permissions must be set on the shared library so that the user running the executable has the rights to read and execute it.
Explain the implication of linking an application with a static library for the final executable. How is this achieved?
- When an application is linked with a static library, the contents of the static library are incorporated directly into the final executable during the linking phase. This increases the size of the executable since it now contains the library code within it. The implication is that the final executable is larger but self-contained, meaning it doesn't rely on external libraries at runtime. This is achieved using a linker tool that combines object files and static libraries into a single executable binary.
How are object files used in creating an executable?
- Object files are the result of compiling source code files. During the linking phase, a linker tool takes these object files and combines them, resolving any references or symbols between them (such as function calls or global variables). The linker also includes code from libraries that the application depends on. The result of this process is a single executable file that can be run on the operating system.
What are the benefits of using dynamically linked libraries over statically linked libraries?
- Reduced Memory Usage: Since shared libraries can be loaded into memory once and shared between multiple running applications, they can reduce the overall memory footprint.
- Smaller Executables: Executables that use dynamic libraries are smaller because they don't contain the library code—they only contain references to the shared libraries.
- Easier Updates: Shared libraries can be updated independently of the executables that use them, allowing for bug fixes and updates without having to recompile the executables.
- Modularity: Dynamic linking promotes modularity, as developers can split their applications into separate components that are easier to manage and update.
- Resource Sharing: Resources like memory and disk space are used more efficiently because the shared code is not duplicated across multiple executables.

How to create a pip package with a fair load of c++ code on it?

First create the package structure. Use a tool that will allow you to create a Python-callable interface to the c++ code, kike cython or pybind11.
Then create a setup.py file that in this case will use setuptools

ext_modules = [
    Extension(
        'my_module.my_cpp_extension',  # Name of the module
        ['my_module/my_cpp_extension.cpp'],  # C++ source files
        include_dirs=[
            get_pybind_include(),  # Path to pybind11 headers
            '/usr/include/eigen3',  # Path to other headers needed
        ],
        language='c++'
    ),
]

setup(
    name='my_package',
    version='0.0.1',
    author='Your Name',
    author_email='your.email@example.com',
    url='https://github.com/yourusername/my_package',
    description='A minimal package with a C++ extension',
    long_description='',
    ext_modules=ext_modules,
    install_requires=['pybind11>=2.5.0'],  # Depend on pybind11
    setup_requires=['pybind11>=2.5.0'],
    cmdclass={'build_ext': build_ext},
    zip_safe=False,
)

Build the package pip install setuptools wheel and then compile it python setup.py build_ext --inplace. This will generate shared objects. The kinds of objects that will be generated are platform specific.
To distribute the package use python [setup.py](http://setup.py/) sdist bdist_wheel to create a wheel and twine to distribute it

What is a wheel? And what’s an egg?
- A wheel is a built package that can be installed without a separate build step.
- Eggs are older. They’re deprecated and I don’t know how to use them.
- There are also sdists: archives that contain the source code of the Python package.

Describing each option from a meta.yml file:

What is the host session?

Host: The "host" environment contains the dependencies that your package will link against during the build process. When you’re cross-compiling your package you want to make sure that the target system has all the libraries necessary to build your software in their OS. The dependencies that need to be present for the package to compile or run during the build process. These are typically included when the package is installed by an end-user, as they are required for the package to function correctly.

This is where the build scripts will look for dependencies to link against or execute during the build process. If you're not cross-compiling, the "host" and "build" platforms are the same.

Use selectors in meta.yaml: You can use selectors like # [win] or # [linux] to include dependencies or run scripts conditionally depending on the platform. Here's a simple example:

requirements:
  host:
    - toolchain  # [linux]
    - m2-make  # [win]

Can I have pip packages, or conda packages from other channels on my recipe?

As a general rule: all dependencies have to be packaged by conda-forge as well. This is necessary to assure ABI compatibility for all our packages.

Some dependencies are so close to the system that they are not packaged with conda-forge. These dependencies have to be satisfied with Core Dependency Tree (CDT) packages.

A CDT package consists of repackaged CentOS binaries from the appropriate version, either 6 or 7 depending on user choice and platform. We manage the build of CDT packages using a centralized repo, conda-forge/cdt-builds, as opposed to generating feedstocks for them. (Note that historically we did use feedstocks but this practice has been deprecated.) To add a new CDT, make a PR on the conda-forge/cdt-builds repo.
What the hell does this cross-compilation stuff is about anyway?

Cross-compilation is a technique where you compile a program on one type of system (the host) to run on another type (the target). Cross-compilation is more commonly used when targeting a platform that is not easily accessible or when the build process is much faster or more convenient in this platform. For example:

Embedded systems often use processors with different architectures than those in the developers' own computers. For instance, a developer might be using an x86_64 architecture computer but needs to build an application for an ARM architecture-based embedded system, like a Raspberry Pi or a smartphone.

Here's how it works in this scenario:

Development Machine: The developer has a powerful x86_64 machine running Linux, macOS, or Windows.
Target Platform: The application needs to run on an ARM-based device, which has less computing power and might not be suitable for efficient development and compilation.

The developer will use a cross-compiler on their x86_64 machine that generates executable code for the ARM architecture. This allows the developer to take advantage of the more powerful development machine, speeding up the compile-test-debug cycle. Once the application is ready, the executable can be transferred to the ARM-based device for execution.

Cross-compilation is often faster than native compilation in such cases because the development machine has more resources (CPU, memory, etc.) than the target device. It also avoids the need to set up a full development environment on the target device, which might be constrained in terms of resources or might not be designed for such tasks.

Packaging in different systems

Search

Categories

Tags