1

I'm trying to write a search program in C# which will search for a string in a large text file(5GB). I've done a simple code shown below but the search results are time consuming and can take about 30 mins to complete. This is how my code looks like-

public List<string> Search(string searchKey)
{
    List<string> results = new List<string>();
    StreamReader fileReader = new StreamReader("D:\Logs.txt");
    while ((line = fileReader.ReadLine()) != null)
    {
        if (line.Contains(searchKey)
        {
            results.Add(line);
        }
    }
}

Although the code works, it runs very slowly and takes about 30 mins to complete. Can we do something to bring the search time under a minute?

Palle Due
  • 4,918
  • 3
  • 19
  • 31
  • Does this answer your question? [Fastest way to search string in large text file](https://stackoverflow.com/questions/37979457/fastest-way-to-search-string-in-large-text-file) – styx Mar 16 '20 at 13:48
  • 2
    Are you only searching the one file for all matches for one string ? If so, it seems unlikely you can speed it up much. Still, 30 mins to search one 5GB file for all matches for one string seems really excessive. Is it running over a network connection? How many lines match? – Matthew Watson Mar 16 '20 at 13:52
  • FYI, syntactically you can make this code a little cleaner: `var results = File.ReadLines("D:\Logs.txt").Where(line => line.Contains(searchKey)).ToList();`. – maccettura Mar 16 '20 at 14:36
  • @MatthewWatson - yes, I'm searching one file for all the matches for one string. There can be between 0 to 50 matches with about 5 on an average. And yes, I'm running it over a network connection. – Dhawal Dhingra Mar 16 '20 at 14:54
  • I suggest timing how long it takes to just do the file reading part of the loop (i.e. comment-out the `if (line.Contains(searchKey)) { results.Add(line); }` part). That will give you a lower-bound on the time taken. I'm guessing you'll discover that almost all the time is taken reading from the network. – Matthew Watson Mar 16 '20 at 15:06

2 Answers2

0

For string search in a really large file, you can use the Boyer Moore Search Algorithm, which is the standard benchmark for practical string-search literature. For its implementation, links are given below:

  • I am pretty sure that ReadLines + Contains beat any hand crafted algo. https://github.com/dotnet/coreclr/blob/775003a4c72f0acc37eab84628fcef541533ba4e/src/utilcode/newapis.cpp#L577 – Drag and Drop Mar 16 '20 at 14:39
0

File indexing is implemented in the library Bsa.Search.Core

You can implement your own version of reading from a file. FileByLinesRowReader - read file by line and add documents with externalId equal line number. FileDocumentIndex has been tested on wiki data json dictionary

  • .Net Core

  • .Net 472

         var selector = new IndexWordSelector();
         var morphology = new DefaultMorphology(new WordDictionary(), selector);
         var fileName = "D:\Logs.txt";
    
         // you can implement your own file reader, csv, json, or other
         var index = new FileDocumentIndex(fileName, new FileByLinesRowReader(null), morphology);
    
         // if index is already exist we skip file indexing
             if (!index.IsIndexed)
         index.Start();
         while (!index.IsReady)
         {
             Thread.Sleep(300);
         }
    
         var query = "("one" | two) ~50 ("error*")".Parse("*");
         var found = index.Search(new SearchQueryRequest()
         {
             Field = "*",
             Query = query,
             ShowHighlight = true,
         });
         // where ExternalId is line number from file  
         //found.ShardResult.First().FoundDocs.FirstOrDefault().Value.ExternalId
    
Stanislav
  • 410
  • 3
  • 6