10

I have a very big CSV file (1GB+), it has 100,000 line.

I need to write a Java program to parse each line from the CSV file to create a body for a HTTP request to send out.

In other words, I need send out 100,000 HTTP requests which are corresponding to the lines in the CSV file. It will be very long if I do these in a single thread.

I'd like to create 1,000 threads to do i) read a line from the CSV file, ii) create a HTTP request whose body contains the read line's content, and iii) send the HTTP request out and receive response.

In this way, I need to split the CSV file into 1,000 chunks, and those chunks should have no overlapped lines in each other.

What's the best way to such a splitting procedure?

wattostudios
  • 8,576
  • 13
  • 42
  • 57
JuliaLi
  • 317
  • 1
  • 5
  • 15
  • 1
    *I have a very big CSV file (1GB+), it has 100,000 line* for nowadays computers it aint big at all. Having significantly more threads than CPUs is a mistake if you can saturate all the CPUs. In the end it'd be bound in the IO departement, also sending tons of concurrent requests to a server is not very wise unless you deliberately attempting DoS. – bestsss Jun 19 '12 at 10:29

6 Answers6

12

Reading a single file at multiple positions concurrently wouldn't let you go any faster (but it could slow you down considerably).

Instead of reading the file from multiple threads, read the file from a single thread, and parallelize the processing of these lines. A singe thread should read your CSV line-by-line, and put each line in a queue. Multiple working threads should then take the next line from the queue, parse it, convert to a request, and process the request concurrently as needed. The splitting of the work would then be done by a single thread, ensuring that there are no missing lines or overlaps.

Sergey Kalinichenko
  • 697,062
  • 78
  • 1,055
  • 1,465
  • Is it possible to do a split operation to split it into multiple chunks in same size before reading the file? If so, after the file is splited, staring multiple threads to read chunks in parallel would be faster than a single thread read the whole file, does it? – JuliaLi Jun 20 '12 at 04:51
  • 1
    @JuliaLi No, not really: large files often occupy multiple blocks that are located close to each other on a disk. Since disks are much faster at accessing consecutive blocks because there is no need to re-position the magnetic head, reading a large file from disk goes much faster when it is done consecutively. – Sergey Kalinichenko Jun 20 '12 at 09:31
5

You can have a thread which reads the lines of the CSV and builds a List of lines read. When this reaches some limit e.g. 100 lines to pass this to a fixed size thread pool to send as a request.

I suspect that unless your server has 1000 cores, you might find that using 10-100 concurrent requests is faster.

Leigh
  • 28,605
  • 10
  • 52
  • 98
Peter Lawrey
  • 513,304
  • 74
  • 731
  • 1,106
  • It depends on how long it takes to get a HTTP response. If the servers involved are slow, most of the threads will be waiting for I/O. – biziclop Jun 19 '12 at 10:22
  • If the network or the server is slow, either using larger batch sizes or more smaller request could improve the load time. Its impossible to say what is optimial without testing it. My point was; don't assume the more threads the better. – Peter Lawrey Jun 19 '12 at 10:26
  • 1
    That's what I meant. As your application is more likely to be I/O bound, a fixed formula based on the number of cores isn't going to work, you have to experiment with what works best. (Or write an adaptive system, which is probably overcomplicating it.) – biziclop Jun 19 '12 at 10:30
  • The server is in the same intranet as the client, and it responses fast. – JuliaLi Jun 20 '12 at 04:38
  • I found that all answers said a single thread for reading CSV file is better than multiple ones. I understood, that 1000 threads would be worse than 10 ones on a 8-core computer. And if each thread reads the file from the first line, certainly 10 are slower than 1. However, is it possible to do a split operation to split it into multiple chunks in same size before reading the file? – JuliaLi Jun 20 '12 at 04:50
  • Its is possible but you have two problems. You need to break the data on a line or split some of you lines in half. Finding the line break is a bit of a pain, but you can do it. The second and main problem is that reading a file sequentially is almost always faster than reading a file randomly. This means that reading a file in many places at once can be much slower unless you have about the same number of disks/spinals as threads reading concurrently (assuming you have a disk subsystem which can support concurrent reads) – Peter Lawrey Jun 20 '12 at 06:54
  • Yeah, the parallel read from disk issue is an important one too. If for example you try it on a regular Windows desktop, you're very likely to end up with a huge performance drop. Other systems will be more accommodating though. – biziclop Jun 20 '12 at 10:55
  • @biziclop Alot depends on the type of disk sub-system you have. On any desktop PC, the cheapest disks are not designed for high workloads. esp laptop drives. – Peter Lawrey Jun 20 '12 at 10:56
  • @biziclop, provided you have a proper raid (esp. w/ hardware controller)you can do multiple reads even on windows, concurrent disk access is mostly about the hardware - SSD is a clear winner as well. Mapping the file to memory is probably the best option for concurrent access. Still I can't see any point doing so as the network/response will be the real bottleneck. – bestsss Jun 20 '12 at 13:59
  • @bestsss That's why I said "regular Windows desktop", which assumes a single no-frills box with a multi-core cpu but a single disk. Caching and buffering on the OS side can also help (or hinder) concurrent access a lot. – biziclop Jun 20 '12 at 14:18
  • I see. Thank all of you very much. – JuliaLi Jun 21 '12 at 11:08
2

Read CSV file in single thread once you get the line delegate this line to one of the Thread available in pool by constructing the object of your Runnable Task and pass it to Executors's submit() ,that will be executed asynchronously .

 public static void main(String[] args) throws IOException {

      String fName = "C:\\Amit\\abc.csv";
      String thisLine;
      FileInputStream fis = new FileInputStream(fName);
      DataInputStream myInput = new DataInputStream(fis);
      ExecutorService pool=Executors.newFixedThreadPool(1000);
      int count = 0;  // Concurrent request to Server barrier

      while ((thisLine = myInput.readLine()) != null) {
          if (count > 150) {
              try {
                  Thread.sleep(100);
                  count = 0;
              } catch (InterruptedException e) {
                  // TODO Auto-generated catch block
                  e.printStackTrace();
              }
          }

          pool.submit(new MyTask(thisLine));
          count++;
      }

    }
}

Here your Task:

class MyTask implements Runnable {
      private String lLine;
      public MyTask(String line) {
           this.lLine=line;

      }

      public void run() {
          // 1) Create Request  lLine
          // 2) send the HTTP request out and receive response
      }
}
Leigh
  • 28,605
  • 10
  • 52
  • 98
amicngh
  • 7,683
  • 3
  • 29
  • 51
2

If you're looking to unzip and parse in the same operation, have a look at https://github.com/skjolber/unzip-csv.

ThomasRS
  • 8,187
  • 5
  • 31
  • 47
1

Have one thread reading the file line by line and for every line read, post a task into an ExecutorService to perform the HTTP request for each one.

Reading the file from multiple threads isn't going to work, as in order to read the nth line, you have to read all the others first. (It could work in theory if your file contained fixed width records, but CSV isn't a fixed width format.)

biziclop
  • 47,775
  • 12
  • 76
  • 101
  • you can infer the end of the line when you know the columns, it's doable but hardly worth the effort. So if there are multiple disk arrays and mapped file multiple threads would work (for the reading part) – bestsss Jun 19 '12 at 13:30
  • Is it possible to do a split operation to split it into multiple chunks in same size before reading the file? If so, after the file is splited, staring multiple threads to read chunks in parallel. – JuliaLi Jun 20 '12 at 04:54
0

Java 8, which is scheduled for release this month, will have improved support for this through parallel streams and lambdas. Oracle's tutorial on parallel streams might be a good starting point.

Note that a pitfall here is too much parallelism. For the example of retrieving URL's, it is likely a good idea to have a low number of parallel calls. Too much parallelism can affect not only bandwidth and the web site you are connecting to, but you will also risk running out of file descriptors, which is a strictly limited resource in most environments where java runs.

Some frameworks that may help you are Netflix' RxJava and Akka. Be aware that these frameworks are not trivial and will take some effort to learn.

xeno
  • 372
  • 2
  • 5