7

I'm trying to learn to use cython to speed up some code based on pysam. My issue is not strictly speaking about bioinformatics, but rather about building tools using a bioinformatics library. I hope this is still relevant in this site (I hesitated posting it as a github issue to pysam, alternatively, but I was not sure that this was a problem with pysam).

In my project's folder, the bam25prime/libbam25prime.pyx file has the following content:

# cython: language_level=3
from pysam.libchtslib cimport BAM_FREVERSE
from pysam.libcalignedsegment cimport AlignedSegment

cdef object ccollapse_ali(AlignedSegment ali):
    """
    Return the bed information corresponding to the 5' collapse
    of AlignedSegment *ali*.
    """
    if (ali.flag & BAM_FREVERSE) != 0:
        return "\t".join([
            ali.reference_name,
            # five prime position
            str(ali.reference_end - 1),
            str(ali.reference_end),
            ".",
            ".",
            "-"])
    else:
        return "\t".join([
            ali.reference_name,
            # five prime position
            str(ali.reference_start),
            str(ali.reference_start + 1),
            ".",
            ".",
            "+"])


def collapse_ali(ali):
    """
    Return the bed information corresponding to the 5' collapse
    of AlignedSegment *ali*.
    """
    return ccollapse_ali(ali)

In my setup.py, the setup function is called with this argument: ext_modules = cythonize("bam25prime/libbam25prime.pyx").

When I try to do the build (running python3.6 setup.py build_ext), cython seems to miss some include paths, because I get the following error:

[1/1] Cythonizing bam25prime/libbam25prime.pyx
running build_ext
building 'bam25prime.libbam25prime' extension
gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -O3 -march=native -fomit-frame-pointer -fPIC -I/home/bli/.local/lib/python3.6/site-packages/pysam -I${HOME}/include -I/home/bli/include/python3.6m -I/home/bli/.local/include -c bam25prime/libbam25prime.c -o build/temp.linux-x86_64-3.6/bam25prime/libbam25prime.o
bam25prime/libbam25prime.c:497:28: fatal error: htslib/kstring.h: No such file or directory

If I manually re-run the gcc command, adding -I/home/bli/.local/lib/python3.6/site-packages/pysam/include/htslib, the header file gets found.

How do I get cython to correctly determine the path to the necessary header files?

I guess another way to formulate the question would be the following: How is cython supposed to determine what include flags to use?

Or maybe there is a problem with my pysam installation?

If I remember well, I installed it using pip3.6 --user install pysam. I have a ~/.pydistutils.cfg file containing the following settings

[install]
optimize=1
[build_ext]
include_dirs=${HOME}/include
library_dirs=${HOME}/lib
rpath=${HOME}/lib
user=1

Any advice appreciated.

bli
  • 3,130
  • 2
  • 15
  • 36
  • Have you tried setting C_INCLUDE_PATH? That's usually sufficient. – Devon Ryan Aug 02 '18 at 17:57
  • Make sure htslib is in your include path – GWW Aug 02 '18 at 19:17
  • 2
    Shouldn't cython infer this automatically when a python library is installed? Or is it because htslib is not in itself a python library, although distributed together with pysam? – bli Aug 03 '18 at 07:59

1 Answers1

6

Based on a suggestion by Devon Ryan, I searched how to set C_INCLUDE_PATH for cython, and found this issue. It refers to a (currently not working) feature present in cython's documentation, that should let libraries inform cython of the correct localisation for the header files in the module's setup:

ext_modules = cythonize(
    "bam25prime/libbam25prime.pyx",
    include_path=pysam_get_include())

As noted in the issue discussion, this seems to have no effect, but a workaround exists, consisting in using the Extension class to store the header localization information.

So the general answer is: Some python libraries (including pysam) provide a function that returns the paths to their include directories. This can be used to inform cython, but has to be done "manually".

My setup.py now looks as follows:

from setuptools import setup, find_packages
# If you have .pyx things to cythonize
from distutils.extension import Extension
from Cython.Build import cythonize
from pysam import get_include as pysam_get_include

name = "bam25prime"
__version__ = "0.1"

# https://github.com/cython/cython/blob/master/docs/src/reference/compilation.rst#configuring-the-c-build
extensions = [
    Extension(
        "bam25prime.libbam25prime", ["bam25prime/libbam25prime.pyx"],
        include_dirs=pysam_get_include()),
    ]

setup(
    name=name,
    version=__version__,
    # [description, author, email, licence, ...]
    packages=find_packages(),
    scripts=["bin/%s" % name],
    ext_modules = cythonize(extensions),
    # ext_modules = cythonize(
    #     "bam25prime/libbam25prime.pyx",
    #     include_path=pysam_get_include()),
    install_requires=["pysam", "pybedtools"])

Then, python3.6 setup.py build_ext succeeds.

And for whoever might be interested, here are the other files of the project:

bin/bam25prime is a link to bam25prime/bam25prime.py, which looks like this:

#!/usr/bin/env python3
# vim: set fileencoding=<utf-8> :
"""
When executed, this module collapses reads in a bam file to their 5-prime end
(reducing them to length 1) and writes the result in bed format.
"""

import argparse
from pybedtools import BedTool, Interval
from pysam import AlignmentFile
from .libbam25prime import collapse_ali

# pybedtools.Interval | pysam.AlignedSegment
# chrom               | reference_name
# start               | reference_start
# end                 | reference_end
# bool(int(strand))   | is_reverse
#
# Interval(reference_name, reference_start, reference_end, strand=strand)    


def collapse_and_sort(alis):
    """
    Collapse elements from *ali* to their 5 prime end.

    The resulting :class:`pybedtool.BedTool` is sorted.

    :param alis: An iterable of :class:`pysam.AlignedSegment`
    :rtype: :class:`pybedtool.BedTool`

    :class:`pybedtool.BedTool` is an iterable:
    It can be iterated over in a for loop, or produce an iterator
    using :func:`iter` on it
    (see https://stackoverflow.com/a/28353158/1878788).
    """
    return BedTool(
        "\n".join(map(collapse_ali, alis)),
        from_string=True).sort(stream=True)


def main():
    """Main function of the program."""
    parser = argparse.ArgumentParser(
        description=__doc__,
        formatter_class=argparse.ArgumentDefaultsHelpFormatter)
    parser.add_argument(
        "-b", "--bamfile",
        required=True,
        help="Sorted and indexed bam file containing the aligned reads.")
    parser.add_argument(
        "-o", "--out_bed",
        required=True,
        help="Bed file in which the results should be written.")
    args = parser.parse_args()
    with AlignmentFile(args.bamfile, "rb") as bamfile:
        collapse_and_sort(bamfile.fetch()).saveas(args.out_bed)
    return 0


if __name__ == "__main__":
    exit(main())

bam25prime/libbam25prime.pyx:

#!/usr/bin/env python3
# vim: set fileencoding=<utf-8> :
# cython: language_level=3
"""
This library contains a cythonized version of a function to collapse aligned
segments to their 5' end.
"""

from pysam.libchtslib cimport BAM_FREVERSE
from pysam.libcalignedsegment cimport AlignedSegment


cdef object ccollapse_ali(AlignedSegment ali):
    """
    Return the bed information corresponding to the 5' collapse
    of AlignedSegment *ali*.
    """
    if (ali.flag & BAM_FREVERSE) != 0:
        return "\t".join([
            ali.reference_name,
            # five prime position
            str(ali.reference_end - 1),
            str(ali.reference_end),
            ".",
            ".",
            "-"])
    else:
        return "\t".join([
            ali.reference_name,
            # five prime position
            str(ali.reference_start),
            str(ali.reference_start + 1),
            ".",
            ".",
            "+"])


def collapse_ali(ali):
    """
    Return the bed information corresponding to the 5' collapse
    of AlignedSegment *ali*.
    """
    return ccollapse_ali(ali)


def main():
    """Example use case."""
    print("No examples to show here.")
    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

(Any advice or suggestions welcome: I'd like to improve my python package-making skills.)


For those interested, to use cython with pybedtools I had to define the following Extension (where glob is glob.glob and /home/bli/src/bedtools2-2.27.1/ is the location of the source code of bedtools).

    Extension(
        "bam25prime.libcollapsebed", ["bam25prime/libcollapsebed.pyx"],
        include_dirs=glob("/home/bli/src/bedtools2-2.27.1/src/utils/*"),
        language="c++"),
bli
  • 3,130
  • 2
  • 15
  • 36