Replacing whitespace with sed in a CSV (to use w/ postgres copy command)

Question

I iterate through a collection of CSV files in bash, running:

iconv --from-code=ISO-8859-1 --to-code=UTF-8 ${FILE} | \
    sed -e 's/\"//g' | \
    sed -e 's/, /,/g' \
    > ${FILE}.utf8

Running iconv to fix UTF-8 characters, then the first sed call removes the double quote characters, and the final sed call is supposed to remove leading and trailing whitespace around the commas.

HOWEVER, I still have a line like this in the saved file:

FALSE,,,, 2.40,,

The COPY command in postgres is kind of dumb, so it thinks " 2.40" is not valid syntax for a numeric value.

Where am I going wrong w/ my processing of the CSV file? Thanks!

score 2 · Answer 1 · answered Apr 14 '12 at 23:29

Probably what is happening is that you have more than one space or more than one field that has a space, and so you are seeing the results of a single successful trailing-blank substitution on a line that had more than one.

BTW, you can give multiple -e arguments to sed. Try this:

... | sed -e 's/"//g' -e 's/ *, */,/g'

score 0 · Answer 2 · answered Apr 14 '12 at 23:23

0

Your second sed only removes trailing spaces (in fact, only a single trailing space). What here is supposed to remove leading spaces?

answered Apr 14 '12 at 23:23

geekosaur

56,235
11
119
111

Forgetting leading spaces.. its not removing a single trailing space, either. – Wells Apr 14 '12 at 23:25
You don't know that; if it had *two* trailing spaces, you get one on output. That was what I meant by the parenthetical. – geekosaur Apr 14 '12 at 23:33

score 0 · Answer 3 · answered Apr 15 '12 at 05:34

Do a construct one of this to remove spaces:

sed -e ':a; s/, /,/g; ta'
sed -e 's/,[ ][ ]*/,/g'

The first one will recursively perform updates till it finds no matching combinations. Second one will search for space followed by a possible group of spaces.

And it is not necessary to remove ", as by default PostgreSQL understands them when using COPY ... WITH (FORMAT CSV)

Erwin Brandstetter · Answer 4 · 2013-10-14T07:44:02.550

A general alternative to preparing the CSV to fit the data-types exactly is to import all data as text (into a temporary table with text columns) and let the PostgreSQL type-coercion mechanisms do some of the work automatically.

In particular, leading and trailing white space is trimmed automatically when a text is cast to a numeric type or date / time type. Try:

SELECT '   234 '::text::int;
SELECT '    23.4 '::text::float8; -- incl. a leading tab
SELECT '    2012-12-1    '::text::date; -- incl. a leading & trailing tab

So if you have a table like:

CREATE TABLE foo
( id    integer,
  col_a date,
  col_b double precision
);

You can:

CREATE TEMP TABLE foo_tmp  -- dropped automatically at end of session
( id    text,
  col_a text,
  col_b text
);

COPY foo_tmp FROM '/path/to/foo_file.csv';
INSERT INTO foo
SELECT id::int
      ,col_a::date
      ,col_b::double precision
FROM   foo_tmp;

Or use any other Postgres function to prepare the string as needed.

All you need is a valid CSV format.

Replacing whitespace with sed in a CSV (to use w/ postgres copy command)

4 Answers4