Detecting all pages which contain color

Question

In an larger LaTeX document there are often only some pages with color content (mainly figures) and the remaining ones are only black and white. Because printing costs for color pages are much higher than for black and white it would be good to be able to extract all pages with color and print them separately. The first step for this is to be able to detect if a page contains color or not. This could be in a form of an text list of page number suitable to be read by a PDF page extraction script (using e.g. pdftk).

A simple solution sufficient for many people would be to detect all pages which contain a figure and assume that only these have color. However, a general solution would be nice. Only color elements which are printed should be taken into account, while e.g. the color frames around link by hyperref should not. It is OK that the solution would disable these for the detection.

The idea I have so far is to use zref page labels and hook them to all \color macros. However, AFAIK these will interfere with vertical mode and might influence the normal typesetting. I will try to code an answer myself but looking forward to see other approaches. — Martin Scharrer, Apr 26 '12 at 20:27
You have probably already found this, anyway: How do I know if PDF pages are color or black-and-white? — gcedo, Apr 26 '12 at 20:32
@Edo: No, I didn't looked on SO. Thanks for the link, its very useful. I'm looking for a more LaTeX-based solution, but these are also to be considered. — Martin Scharrer, Apr 26 '12 at 20:36
I would still say it's easiest to analyse the PDF, like suggested in the referred question. I don't know whether there are free tools to color-separate a PDF, but pdftoppm and then ImageMagic to check the colors should be easy to do. Trying to hook into \color you'll face enormous problems. To name two: (a) you need to identify colors which are really gray. (b) what happens with pages where colored text has been broken to by a page break? They won't contain a color change (to black at most). And then you haven't even covered images. — Stephan Lehmke, Apr 26 '12 at 21:00
@StephanLehmke: Good points. I like Image Magick most, so I had a look. Its identify tools seems to be suitable. It converts PDF pages to raster images and seems to select the required color space itself (e.g. Gray or RGB). My first try: for N in `seq 1 $PAGES`; do echo -n "$N: "; identify -format "%[colorspace]" $FILE.pdf[$((N-1))]; done. Prints either Gray or RGB for each page depending if there are colors on it. Ignores hyperref color borders and possible other PDF annotations. — Martin Scharrer, Apr 26 '12 at 21:29
@StephanLehmke: I now wrote a small script for this and posted for the original SO question: How do I know if PDF pages are color or black-and-white? — Martin Scharrer, Apr 26 '12 at 22:13
Could you self-answer this question? I keep getting it in "Unanswered" :-) — Stephan Lehmke, Apr 29 '12 at 05:01
"Because printing costs for color pages are much higher". While that is indeed true, this does not imply "printing on a color printer is more expensive": All half-recent professional color laser printers I am aware of account color/bw pages page-wise, that is, if you print a 100 pages job with 10 color pages, you pay for 90 bw pages and 10 color pages. So I suggest checking this with your print provider. It might save you from some manual and error-prone sorting of pages :-) — Daniel, May 16 '12 at 13:44

Kurt Pfeifle · Answer 1 · 2020-12-09T16:24:58.370

71

Newer versions of Ghostscript (version 9.05 and later) include a "device" called inkcov. It calculates the ink coverage of each page (not for each image) in Cyan (C), Magenta (M), Yellow (Y) and Black (K) values, where 0.00000 means 0%, and 1.00000 means 100%.

Example commandline:

gs -o - -sDEVICE=inkcov /path/to/your.pdf

Example output:

Page 1
0.00000  0.00000  0.00000  0.02230 CMYK OK
Page 2
0.02360  0.02360  0.02360  0.02360 CMYK OK
Page 3
0.02525  0.02525  0.02525  0.00000 CMYK OK
Page 4
0.00000  0.00000  0.00000  0.01982 CMYK OK

You can see here that the pages 1+4 are using no color, while pages 2+3 do. This case is particularly 'nasty' for people who want to save on color ink: because all the respective C, M, Y (and K) values are exactly the same for each of the pages 2+3, they possibly could appear to the human eye not as color pages, but as ("rich") grayscale anyway (if each single pixel is mixed with these color values).

Ghostscript can also convert color into grayscale. Example commandline:

gs                                \
  -o grayscale.pdf                \
  -sDEVICE=pdfwrite               \
  -sColorConversionStrategy=Gray  \
  -sProcessColorModel=/DeviceGray \
   /path/to/your.pdf

Checking for the ink coverage distribution again (note how the addition of -q to the parameters slightly changes the output format):

gs -q  -o - -sDEVICE=inkcov grayscale.pdf
 0.00000  0.00000  0.00000  0.02230 CMYK OK
 0.00000  0.00000  0.00000  0.02360 CMYK OK
 0.00000  0.00000  0.00000  0.02525 CMYK OK
 0.00000  0.00000  0.00000  0.01982 CMYK OK

edited Dec 09 '20 at 16:24

answered Jun 26 '12 at 05:39

Kurt Pfeifle

3,791

4

Why did you say that in the first example page 1 uses color? Was it a mistake? – giordano Sep 30 '13 at 17:10
4

Here's a quick command for folks who want to count the number of non-color pages in a PDF:
gs -o - -sDEVICE=inkcov path/to/pdf.pdf | grep -e "0.00000 0.00000 0.00000 [01]." | wc

And +1 to giordano.
– igordcard Dec 10 '14 at 15:19
@giordano: Thanks for the hint in your comment -- didn't see it before. Of course you are right. I can't tell you anymore what happened. Maybe I pasted the wrong PDF file's output into the editing window. I'll correct the answer accordingly... – Kurt Pfeifle Feb 06 '15 at 15:55
I only returned to this answer because SE notified me that it today has earned me the 'Populist' batch :-) – Kurt Pfeifle Feb 06 '15 at 16:02
Page 2 may be as well in gray-scale, so this doesn't mean the page contains color. So if C=M=Y then you know that the page is not black. – Piotr Oct 31 '16 at 14:23
@Piotr But Page 2 would be best off printed as grayscale, rather than full colour, as the latterwould use up colour ink but give the same result? – owjburnham Oct 31 '17 at 14:07
@Piotr: your conclusions are premature! Page 2 could consist of 4 pure color patches of equal size (C, M, Y and K). It could also be a single gray looking area where each individually pixel is composed of identical values for C, M, Y and K. Or anything in between... The only thing we know: it uses Color to create its appearance! – Kurt Pfeifle Oct 31 '17 at 16:44
@igordcard: Your command didn't work for me but I could fix it by replacing each space by a two spaces. And I added the -l parameter to wc so that it doesn't display superfluos information, resulting in grep -e "0.00000 0.00000 0.00000 [01]." /tmp/t | wc -l. – Konrad Höffner Mar 23 '20 at 13:44

score 17 · Accepted Answer · edited May 23 '17 at 12:39

For the general case it seem to be indeed better to use an external tool to test for all pages which contain colors. This is the topic of the mentioned SO question How do I know if PDF pages are color or black-and-white?. I now wrote an answer to it which includes small script for this.

However, it is much easier to get a list of all pages containing figures. Here I use the zref-abspage package to get an absolute page counter. The normal \write command can be used which will expand its content when the surrounding content is really placed on a page. Therefore the page counters will have the correct value. Then the end-macro of figure can simply be patched to hold this code.

\documentclass{book}
\usepackage{mwe}

\usepackage{zref-abspage}% absolute page counter
\newwrite\figpages
\openout\figpages=\jobname.fpg
\makeatletter
\g@addto@macro\endfigure{%
    % Write absolute page number and page label to file
    % Do not use \immediate!
    \write\figpages{\number\value{abspage}: \thepage}%
}
\makeatother

\newcount\mycount% for example loop
\begin{document}
\frontmatter
\Blindtext

\begin{figure}
    \centering
    \includegraphics[width=.8\textwidth,height=5cm]{example-image}
    \caption{Some caption}
\end{figure}

\mainmatter
\Blindtext

\loop% keep MWE small by using a loop

\begin{figure}
    \centering
    \includegraphics[width=.8\textwidth,height=5cm]{example-image}
    \caption{Some caption}
\end{figure}

{\Blindtext}

\advance\mycount by 1
    \ifnum\mycount<20\relax
\repeat

\backmatter
\appendix
\Blindtext

\begin{figure}
    \centering
    \includegraphics[width=.8\textwidth,height=5cm]{example-image}
    \caption{Some caption}
\end{figure}

\end{document}

This generates a .fpg file (for figure pages) which looks like:

The format can be changed if required.

score 10 · Answer 3 · edited Apr 10 '13 at 15:47

10

There's a rather useful python script at http://homepages.inf.ed.ac.uk/imurray2/code/hacks/pdfcolorsplit which uses pdftk to split into colour and b&w files, though it doesn't deal with the boxes around hyperrefs. If you have access to the LaTeX source, why not turn off the colour in hyperref anyway - I do it like this:

\usepackage[colorlinks=true,
            linkcolor=black,
            citecolor=black,
            filecolor=black,
            urlcolor=black]{hyperref}

IIRC if you just set [colorlinks=false] they're not clickable.

edited Apr 10 '13 at 15:47

Sean Allred

27,421

answered Apr 10 '13 at 15:14

Chris H

8,705

In fact, setting colorlinks=false only means that links are framed instead of colored. (Oh! and Welcome to TeX.SX!) – Sean Allred Apr 10 '13 at 15:41
1

Although I can't tell from the script itself: does this separate channels or pages? – Sean Allred Apr 10 '13 at 15:47
'colorlinks=false' only means that links are framed instead of colored.

OK - I had forgotten the reason why I did this.

And thanks for the welcome.

It splits pages, depending on the options, you can assemble 1 PDF of colour pages, and another of B+W, and it can handle double sided. It's not my script, but it's been very useful for theses round here - colour pages cost ~6x as much to have printed.

I've an idea to do something based on that for printing only the pages (of e.g. a scanned document) with more than some threshold of non-white - where white is also thresholded.
– Chris H Apr 10 '13 at 17:55

score 6 · Answer 4 · edited May 11 '16 at 16:54

I extend Chris H's answer:

I extented the pdfcolorsplit.py script with an option -r to reassemble all split parts again into a final pdf, by converting all b/w parts to grayscale before reassembling:

use like (-p option worked the best) :

./pdfcolorsplit.py -p -v -s -r Report.pdf

The code is here:

#!/usr/bin/env python
# Python 2 and 3 compatible.

# Python program to take a pdf file, and split it into color and black
# and white part(s). Requires pdftk and one of gs and pdftoppm.
#
# Iain Murray, February 2010.
#
# Inspired by dvicoloursplit.py, Jeremy Sanders 2001, although written
# from scratch.
#
# 2011-09-19 fixed bug with odd numbers of pages reported by Richard Shaw
# 2012-06-11 tweaked to run in Python 3 as well as 2.

##  This program is free software; you can redistribute it and/or modify
##  it under the terms of the GNU General Public License as published by
##  the Free Software Foundation; either version 2 of the License, or
##  (at your option) any later version.

##  This program is distributed in the hope that it will be useful,
##  but WITHOUT ANY WARRANTY; without even the implied warranty of
##  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
##  GNU General Public License for more details.

import os, os.path, sys, string, re, tempfile, shutil, getopt
import heapq

def a2b(x):
    """Turn ascii into bytes for Python 3, in way that works with Python 2"""
    try:
        return bytes(x)
    except:
        return bytes(x, 'ascii')

def iscolorppm(filename):
    """Does the PPM file contain any non-grayscale colors?"""
    file = open(filename, 'rb')
    # Ugly: I read the whole file into RAM, and copy it needlessly a lot
    data = file.read()
    file.close()

    # PPM is a *very* liberal file format. It allows comments anywhere in the
    # header, even in the middle of tokens.
    comments_re = re.compile(a2b('^([^ \t\n]*)#[^\n]*\n'))
    split_re = re.compile(a2b('^([ \t\n]|#[^\n]*\n)+([^ \t\n#])'))
    tok_re = re.compile(a2b('^([^ \t\n]*)([ \t\n].*)'), re.DOTALL)
    toks = []
    while len(toks) < 4:
        while split_re.match(data):
            data = split_re.sub(r'\2', data)
        while comments_re.match(data):
            data = comments_re.sub(r'\1', data)
        (tok, data) = tok_re.match(data).groups()
        toks.append(tok)
    magic = toks[0]
    (width, height, max_color) = map(int, toks[1:])
    data = data[1:]

    if magic == b'P3':
        binary = False
    elif magic == b'P6':
        binary = True
    else:
        print("%s is not a valid PPM file" % filename)
        sys.exit(1)

    # Massage data so adjacent triples should have the same value in b/w images
    data_len = width*height*3
    if binary:
        if int(max_color) > 255:
            # Untested. Each intensity is in two bytes.
            data_len *= 2
            data = data[1:data_len:2] + data[:data_len:2]
    else:
        data = [int(x) for x in data.split()]

    if len(data) < data_len:
        print('PPM file is truncated?')
        sys.exit(1)

    triples = zip(data[0:data_len:3], data[1:data_len:3], data[2:data_len:3])
    black_and_white = all((a==b and a==c for (a,b,c) in triples))
    return not black_and_white


def pdfcolorsplit(file, doublesided, merge, use_pdftoppm, reassemble, verbose):
    # Work out which pages are color
    if verbose:
        print('Analyzing %s...' % file)
    tmpdir = tempfile.mkdtemp(prefix = 'pdfcs_')
    if use_pdftoppm:
        root = os.path.join(tmpdir, 'page')
        os.system('pdftoppm -r 20 "%s" "%s"' % (file, root))
    else:
        gs_opts = '-sDEVICE=ppmraw -dBATCH -dNOPAUSE -dSAFE -r20'
        if not verbose:
            gs_opts += ' -q'
        os.system('gs ' + gs_opts + ' -sOutputFile="%s" "%s"' \
                % (os.path.join(tmpdir, 'tmp%06d.ppm'), file))
    PPMs = os.listdir(tmpdir)
    PPMs.sort()
    iscolor = [iscolorppm(os.path.join(tmpdir, x)) for x in PPMs]
    num_pages = len(iscolor)
    shutil.rmtree(tmpdir)
    if doublesided:
        # Treat as color those b/w pages that share a sheet with a color page
        iscolorpair = [x or y for (x,y) in zip(iscolor[::2], iscolor[1::2])]
        iscolor[:2*len(iscolorpair):2] = iscolorpair
        iscolor[1::2] = iscolorpair

    # Construct page range strings
    flips = [x for x in range(2,num_pages+1) if iscolor[x-1] != iscolor[x-2]]
    if not flips:
        if verbose:
            print('No splitting needs to be done, skipping %s' % file)
        return
    edges = [1] + flips + [num_pages+1]
    ranges = ['%d-%d' % (x,y-1) for (x,y) in zip(edges[:-1], edges[1:])]

    print(iscolor, ranges)

    # Finally output split files
    if verbose:
        print('Outputing splits as new pdf files...')
    base_name = file
    if base_name.lower().endswith('.pdf'):
        base_name = base_name[:-4]
    suffixes = ['_bwsplit', '_colorsplit']
    # jobs is a seq of (range, filename) pairs, e.g. ('1-3', 'colorbits.pdf')
    # convert jobs
    if merge:
        jobs = ((' '.join(ranges[0::2]), base_name + suffixes[iscolor[0]]),\
                (' '.join(ranges[1::2]), base_name + suffixes[not iscolor[0]]))
    else:
        jobs = [(r, '%s_%03d%s' % (base_name,n+1,suffixes[(n+iscolor[0])%2])) \
                for (n,r) in enumerate(ranges)]



    for (pages, name) in jobs:
        if verbose:
            print('pdftk "%s" cat %s output "%s.pdf"' % (file, pages, name))
        os.system('pdftk "%s" cat %s output "%s.pdf"' % (file, pages, name))

    # reassemble all continuous files into final output by converting b/w parts to grayscale
    if reassemble:
      graySuffix = "_gray"
      jobsconvert = [ j for j in jobs[ int(iscolor[0])::2] ]
      #print(jobsconvert)
      # convert all b/w to gray
      for (pages,name) in jobsconvert:
        cmd="gs  -sOutputFile=%s%s.pdf  -sDEVICE=pdfwrite  -dAutoRotatePages=/None -sColorConversionStrategy=Gray  -dProcessColorModel=/DeviceGray  -dCompatibilityLevel=1.4  -dNOPAUSE  -dBATCH %s.pdf" % (name,graySuffix,name)
        if verbose:
           print(cmd) 
        os.system(cmd)

      ## interleave converted b/w and colors and make pdftk cat command
      cJobs = jobs[0::2] if iscolor[0] else jobs[1::2]
      #print(cJobs)
      bwJobs = [ (pages,name+graySuffix) for pages,name in jobsconvert]

      def interleave(l1, l2):
        iter1 = iter(l1)
        iter2 = iter(l2)
        while True:
            try:
                if iter1 != None:
                    yield next(iter1)
            except StopIteration:
                iter1 = None
            try:
                if iter2 != None:
                    yield next(iter2)
            except StopIteration:
                iter2 = None
            if iter1 == None and iter2 == None:
                raise StopIteration()


      jobsCatAll = interleave(cJobs,bwJobs) if iscolor[0] else interleave(bwJobs,cJobs)
      #print(list(jobsCatAll))

      cmd = "pdftk " + " ".join([j[1]+".pdf" for j in jobsCatAll]) + " cat output %s%s.pdf " % (base_name,"_all")
      if verbose:
        print(cmd)  
      os.system(cmd)

def usage():
    progname = os.path.basename(sys.argv[0])
    print('Usage: %s [OPTIONS] <PDF-file(s)>' % progname)
    print('')
    print('Splits PDF files into color and black and white sections.')
    print('')
    print('Options:')
    print('   -m option merges color and b/w parts to give two files.')
    print('      The default is to output numbered contiguous pieces')
    print('      that could easily be reassembled.')
    print('   -s option chooses simplex rather than duplex output')
    print('   -v verbose.')
    print('   -p Use pdftoppm rather than gs to detect color. Faster,')
    print('      but can get confused by hyperlinks that do not print.')
    print('   -r Reassemble all continuous files by converting all b/w ')
    print('      parts to grayscale (requires gs).')

def main():
    try:
        opt_pairs, filenames = getopt.gnu_getopt(sys.argv[1:], "hvpmsr", ["help"])
    except getopt.GetoptError as err:
        print(str(err))
        usage()
        sys.exit(1)
    if opt_pairs:
        opts = list(zip(*opt_pairs))[0]
    else:
        opts = []
    if ('-h' in opts) or ('--help' in opts) or (not filenames):
        usage()
        sys.exit()
    verbose = '-v' in opts
    use_pdftoppm = '-p' in opts
    merge = '-m' in opts
    doublesided = not ('-s' in opts)

    reassemble = '-r' in opts

    if merge and reassemble:
      raise ValueError("Merge and reassemble options not compatible!")

    for file in filenames:
        pdfcolorsplit(file, doublesided, merge, use_pdftoppm, reassemble, verbose)

if __name__ == "__main__":
    main()

This python script worked perfectly well. I tried other solutions, but this was the only one working for me. The other solutions are using imagemagick's command line tool identify, but this did not work with my pdf and tagged every page as being color (probably because they were colorcoded). — jespestana, May 26 '17 at 06:46

score 3 · Answer 5 · edited May 20 '19 at 12:57

Here is a MATLAB script which uses Kurt Pfeifle's answer to split a PDF into two files, one colour and one grayscale. The original file is preserved. It does not handle double-sided printing.

It is not bullet-proof and might need some debugging, but hopefully it will work out of the box.

You will need:

Ghostscript (version 9.05 and later)
MATLAB function ghostscript() and user_input() from here
pdftk

Here is the script (you will need to change lines 4,5 and possibly 6):

clear all; close all; clc;

%Change these:
pathToFile = '/Users/nikos/Desktop/';
fName = 'thesis.pdf';
%you might need to change the path to pdftk (if in windows for example)
pdftkPath = '/usr/local/bin/pdftk';

disp('Reminder: you might want to set \hypersetup{colorlinks=false} in latex');
disp('Do you want to manually set as grayscale any pages that have (C == M == Y)?');
a = input('Otherwise they will be treated as colour! (y/n) ','s');

if (a~= 'y' && a~='Y')
    manualMode = false;
else
    manualMode = true;
end

[status, ret] = ghostscript(['-o - -sDEVICE=inkcov ',pathToFile,fName]);

inds = strfind(ret,'0.');
pages = length(inds)/4;

if (round(pages) ~= pages)
    disp('Something went wrong');
    disp('Check the variable ret');
    disp('I am looking the the string ''0.'' which should only occur when listing CMYK values');
end

a = input(['Is your pdf ', num2str(pages), ' pages long (y/n) ?'],'s');

if (a ~= 'y' && a ~= 'Y')
    break;
end

disp([num2str(pages), ' pages processed.']);
c = 1:4:length(inds);
m = 2:4:length(inds);
y = 3:4:length(inds);
k = 4:4:length(inds);

colorPages = '';
bwPages = '';
cpCounter = 0;
bwCounter = 0;
for i = 1:pages
    C = str2num(ret(inds(c(i)):inds(c(i))+6));
    M = str2num(ret(inds(m(i)):inds(m(i))+6));
    Y = str2num(ret(inds(y(i)):inds(y(i))+6));
    K = str2num(ret(inds(k(i)):inds(k(i))+6));

    if (C == 0 && M == 0 && Y == 0)
        bwPages = [bwPages, ' ',num2str(i)];
        bwCounter = bwCounter+1;
    elseif (C == M && C == Y && manualMode)
        a = input(['Is page ', num2str(i), ' colour (C == M == Y) (y/n) ?'],'s');
        if (a ~= 'y' && a ~= 'Y')
            bwPages = [bwPages, ' ',num2str(i)];
            bwCounter = bwCounter+1;            
        else
            colorPages = [colorPages, ' ', num2str(i)];
            cpCounter = cpCounter+1;
        end
    else
        colorPages = [colorPages, ' ', num2str(i)];
        cpCounter = cpCounter+1;
    end
end

cName = [pathToFile, 'color_',fName];
bName = [pathToFile, 'bw_',fName];
disp([cName, ' (',num2str(cpCounter), ' pages)']);
disp([bName, ' (',num2str(bwCounter), ' pages)']);

system([pdftkPath, ' ', pathToFile, fName, ' cat ', colorPages,' output ', cName]);
system([pdftkPath, ' ', pathToFile, fName, ' cat ', bwPages,' output ', bName]);

Detecting all pages which contain color

5 Answers5

Linked