import from CSV into Ruby array, with 1st field as hash key, then lookup a field's value given header row

Question

Maybe somebody can help me.

Starting with a CSV file like so:

Ticker,"Price","Market Cap"
ZUMZ,30.00,933.90
XTEX,16.02,811.57
AAC,9.83,80.02

I manage to read them into an array:

require 'csv'
tickers = CSV.read("stocks.csv", {:headers => true, :return_headers => true, :header_converters => :symbol, :converters => :all} )

To verify data, this works:

puts tickers[1][:ticker]
ZUMZ

However this doesn't:

puts tickers[:ticker => "XTEX"][:price]

How would I go about turning this array into a hash using the ticker field as unique key, such that I could easily look up any other field associatively as defined in line 1 of the input? Dealing with many more columns and rows.

Much appreciated!

EDIT: the solution I now use is http://stackoverflow.com/a/9851131/1069375 — Marcos, Oct 17 '12 at 12:37

Michael Kohl · Answer 1 · 2011-12-12T16:34:25.680

34

Like this (it works with other CSVs too, not just the one you specified):

require 'csv'

tickers = {}

CSV.foreach("stocks.csv", :headers => true, :header_converters => :symbol, :converters => :all) do |row|
  tickers[row.fields[0]] = Hash[row.headers[1..-1].zip(row.fields[1..-1])]
end

Result:

{"ZUMZ"=>{:price=>30.0, :market_cap=>933.9}, "XTEX"=>{:price=>16.02, :market_cap=>811.57}, "AAC"=>{:price=>9.83, :market_cap=>80.02}}

You can access elements in this data structure like this:

puts tickers["XTEX"][:price] #=> 16.02

Edit (according to comment): For selecting elements, you can do something like

 tickers.select { |ticker, vals| vals[:price] > 10.0 }

edited Dec 12 '11 at 16:34

answered Dec 12 '11 at 15:58

Michael Kohl

64,924
11
136
155

Wow quick reply thank you!! Still getting the hang of this lang. Next I'm researching how to apply one or more filters(eg. return this hashed array w/all prices over 2.01) – Marcos Dec 12 '11 at 16:27
2

If this answer helped you, please upvote and/or accept (the little tick mark below the voting arrows) it, that's StackOverflow etiquette. I'll update my answer to address the filtering question :-) – Michael Kohl Dec 12 '11 at 16:30
1

My 2.8MB file with under 7000 rows and ~40 columns takes way too long on this foreach loop, over 5min, reading in only a few cols [1..4] testing in irb. Have to stick to snappy fast awk query to keep my script under 20s till I can figure this out within Ruby :( – Marcos Dec 12 '11 at 16:55
1

Dang, ain't got the reputation to upvote you...been consulting this great site for yrs but only now created acct :) – Marcos Dec 12 '11 at 16:57
Then just accept the answer, you should always be able to do that. As for the speed, Ruby isn't known for being fast, but 5 minutes does sound like a lot. You could try JRuby, but if you have a 20s limit, JVM startup may offset potential speed gains. But then there's nothing wrong with using `awk` for parsing CSV, it's actually very well suited for the task :-) – Michael Kohl Dec 12 '11 at 17:04
BTW my sample CSV datasource http://tinyurl.com/AllStocksFinviz if it helps. Thanks for all explaining & examples! – Marcos Dec 12 '11 at 17:05
This discussion is getting too long, but on my laptop this is how long it took me to parse the entire file (all columns): `ruby foo.rb Downloads/finviz.csv 36,00s user 1,07s system 92% cpu 40,260 total`. If you are on 1.8 you should try the `FasterCSV` gem. – Michael Kohl Dec 12 '11 at 17:27
try without the :converters => :all param. I'm not sure why that's there and it will likely slow it down a bit. – pguardiario Dec 12 '11 at 23:49
Now that my framework is maturing, and needing more fields of various types to read and process flexibly, I'm looking to shift back from my fast `awk` solution to 100% native Ruby. But it has to be just as fast and available to my scripts, so it seems `memcached` will be a good fit for me. I.e. I will structure to preload my CSV file into a memory object for Ruby to later use instantly. – Marcos Feb 25 '12 at 15:12
I wrote a gem that yields hashed CSV records, with no extra effort: https://rubygems.org/gems/hasherize_csv – brentiumbrent Feb 07 '13 at 16:40

score 6 · Answer 2 · edited Dec 16 '14 at 00:12

6

CSV.read(file_path, headers:true, header_converters: :symbol, converters: :all).collect do |row|
  Hash[row.collect { |c,r| [c,r] }]
end

edited Dec 16 '14 at 00:12

John Bachir

22,081
24
147
213

answered Jun 30 '13 at 04:57

Mr. Demetrius Michael

2,186
5
26
37

score 1 · Answer 3 · answered Mar 16 '15 at 10:08

To add on to Michael Kohl's answer, if you want to access the elements in the following manner

puts tickers[:price]["XTEX"] #=> 16.02

You can try the following code snippet:

CSV.foreach("Workbook1.csv", :headers => true, :header_converters => :symbol, :converters => :all) do |row|
    hash_row =  row.headers[1..-1].zip( (Array.new(row.fields.length-1, row.fields[0]).zip(row.fields[1..-1])) ).to_h
    hash_row.each{|key, value| tickers[key] ? tickers[key].merge!([value].to_h) : tickers[key] = [value].to_h}
end

score 0 · Answer 4 · answered Aug 02 '13 at 01:45

Not as 1-liner-ie but this was more clear to me.

csv_headers = CSV.parse(STDIN.gets)
csv = CSV.new(STDIN)

kick_list = []
csv.each_with_index do |row, i|
  row_hash = {}
  row.each_with_index do |field, j|
    row_hash[csv_headers[0][j]] = field
  end
  kick_list << row_hash
end

score 0 · Accepted Answer · edited May 23 '17 at 12:33

0

To get the best of both worlds (very fast reading from a huge file AND the benefits of a native Ruby CSV object) my code had since evolved into this method:

$stock="XTEX"
csv_data = CSV.parse IO.read(%`|sed -n "1p; /^#{$stock},/p" stocks.csv`), {:headers => true, :return_headers => false, :header_converters => :symbol, :converters => :all}

# Now the 1-row CSV object is ready for use, eg:
$company = csv_data[:company][0]
$volatility_month = csv_data[:volatility_month][0].to_f
$sector = csv_data[:sector][0]
$industry = csv_data[:industry][0]
$rsi14d = csv_data[:relative_strength_index_14][0].to_f

which is closer to my original method, but only reads in one record plus line 1 of the input csv file containing the headers. The inline sed instructions take care of that--and the whole thing is noticably instant. This this is better than last because now I can access all the fields from Ruby, and associatively, not caring about column numbers anymore as was the case with awk.

edited May 23 '17 at 12:33

Community

1
1

answered Mar 24 '12 at 10:52

Marcos

4,588
5
39
61

1

This isn't Perl -- you don't need the `$`, and in fact you shouldn't have it, because in Ruby, `$` indicates global variables, and use of globals is generally bad practice. – Marnen Laibow-Koser Nov 19 '12 at 06:19
1

In this case the globals are intentional. But I do realize that if the entire program were more OO, things would be written better. – Marcos Nov 19 '12 at 09:43
Exactly. Excessive use of globals (i.e. really *any* use except for things like configuration data) usually indicates a design problem that you'll want to fix. – Marnen Laibow-Koser Nov 19 '12 at 17:21
Downvoter note: This solution crossing multiple technologies (not just Ruby) has been our chosen one for years, since the original purely-Ruby answer, while it works, is highly inefficient, and simply too slow in production. – Marcos Aug 19 '14 at 17:53

score -1 · Answer 6 · answered Dec 14 '11 at 18:10

While this isn't a 100% native Ruby solution to the original question, should others stumble here and wonder what awk call I wound up using for now, here it is:

$dividend_yield = IO.readlines("|awk -F, '$1==\"#{$stock}\" {print $9}' datafile.csv")[0].to_f

where $stock is the variable I had previously assigned to a company's ticker symbol (the wannabe key field). Conveniently survives problems by returning 0.0 if: ticker or file or field #9 not found/empty, or if value cannot be typecasted to a float. So any trailing '%' in my case gets nicely truncated.

Note that at this point one could easily add more filters within awk to have IO.readlines return a 1-dim array of output lines from the smaller resulting CSV, eg.

 awk -F, '$9 >= 2.01  &&  $2 > 99.99  {print $0}' datafile.csv

outputs in bash which lines have a DivYld (col 9) over 2.01 and price (col 2) over 99.99. (Unfortunately I'm not using the header row to to determine field numbers, which is where I was ultimately hoping for some searchable associative Ruby array.)

The problem with `-F,` is that `awk` treats every comma as delimiter even when it occurs inside a quoted field, like a company's name: `"Apple, Inc."` Only `"Apple` returns when I ask for field `$15` of that row. — Marcos, Feb 25 '12 at 14:59

import from CSV into Ruby array, with 1st field as hash key, then lookup a field's value given header row

6 Answers6

Linked