0

I iterate through a collection of CSV files in bash, running:

iconv --from-code=ISO-8859-1 --to-code=UTF-8 ${FILE} | \
    sed -e 's/\"//g' | \
    sed -e 's/, /,/g' \
    > ${FILE}.utf8

Running iconv to fix UTF-8 characters, then the first sed call removes the double quote characters, and the final sed call is supposed to remove leading and trailing whitespace around the commas.

HOWEVER, I still have a line like this in the saved file:

FALSE,,,, 2.40,,

The COPY command in postgres is kind of dumb, so it thinks " 2.40" is not valid syntax for a numeric value.

Where am I going wrong w/ my processing of the CSV file? Thanks!

Wells
  • 9,693
  • 10
  • 50
  • 77

4 Answers4

2

Probably what is happening is that you have more than one space or more than one field that has a space, and so you are seeing the results of a single successful trailing-blank substitution on a line that had more than one.

BTW, you can give multiple -e arguments to sed. Try this:

... | sed -e 's/"//g' -e 's/ *, */,/g'
DigitalRoss
  • 139,415
  • 24
  • 238
  • 326
0

Your second sed only removes trailing spaces (in fact, only a single trailing space). What here is supposed to remove leading spaces?

geekosaur
  • 56,235
  • 11
  • 119
  • 111
  • Forgetting leading spaces.. its not removing a single trailing space, either. – Wells Apr 14 '12 at 23:25
  • You don't know that; if it had *two* trailing spaces, you get one on output. That was what I meant by the parenthetical. – geekosaur Apr 14 '12 at 23:33
0

Do a construct one of this to remove spaces:

sed -e ':a; s/, /,/g; ta'
sed -e 's/,[ ][ ]*/,/g'

The first one will recursively perform updates till it finds no matching combinations. Second one will search for space followed by a possible group of spaces.

And it is not necessary to remove ", as by default PostgreSQL understands them when using COPY ... WITH (FORMAT CSV)

vyegorov
  • 20,303
  • 6
  • 57
  • 72
0

A general alternative to preparing the CSV to fit the data-types exactly is to import all data as text (into a temporary table with text columns) and let the PostgreSQL type-coercion mechanisms do some of the work automatically.

In particular, leading and trailing white space is trimmed automatically when a text is cast to a numeric type or date / time type. Try:

SELECT '   234 '::text::int;
SELECT '    23.4 '::text::float8; -- incl. a leading tab
SELECT '    2012-12-1    '::text::date; -- incl. a leading & trailing tab

So if you have a table like:

CREATE TABLE foo
( id    integer,
  col_a date,
  col_b double precision
);

You can:

CREATE TEMP TABLE foo_tmp  -- dropped automatically at end of session
( id    text,
  col_a text,
  col_b text
);
COPY foo_tmp FROM '/path/to/foo_file.csv';
INSERT INTO foo
SELECT id::int
      ,col_a::date
      ,col_b::double precision
FROM   foo_tmp;

Or use any other Postgres function to prepare the string as needed.

All you need is a valid CSV format.

Erwin Brandstetter
  • 539,169
  • 125
  • 977
  • 1,137