6

The file contains 10 million user names and passwords. I want to analyze the data, but it seems a unwieldy. I tried to use Excel 2003 but it maxed out at 65k rows. LibreOffice Calc maxes out at 1 million rows.

I wanted to create a Pivot Table after the data is loaded, but not having much success.

Some results I want to see:

  • Count of most common passwords
  • How many passwords are all lower, upper, numeric, or special characters only
  • What is the password structure in terms of containing upper, lower, numeric, special and their placement in the password.

It seems if I want to perform analysis, I need to get the data into some sort of database and be able to run functions and queries against the data.

How do you analyze 2 columns with 10 million rows?

Patrick Hoefler
  • 5,790
  • 4
  • 31
  • 47
Sun
  • 624
  • 1
  • 4
  • 11

2 Answers2

4

What are your goals? You may need to upgrade your toolset. Bash command line tools like cut, sort, uniq are very useful for this kind of thing.

https://stackoverflow.com/questions/6712437/find-duplicate-lines-in-a-file-and-count-how-many-time-each-line-was-duplicated

If you want to analyze digit frequency, transitions between letters (n-grams), common variations in passwords, then a toolkit like R's natural language processing libraries will get you far:

http://cran.r-project.org/web/views/NaturalLanguageProcessing.html

If you change the dataset around and look at variations like LLLNL (letter-letter-number-letter), you could probably gain great insight into the variation inherent in commonly used passwords.

respectPotentialEnergy
  • 1,550
  • 1
  • 10
  • 11
1

If it is a plain text file with plain text passwords, PACK is a tool to analyze text passwords. It provides several python scripts to do different level of analysis. You could run those scripts directly in the command line environment. Also, the website provides a detailed documentation that you could refer to.

Based on your requirement, I think the script statgen.py and the first part of doc should be sufficient. It generates different basic statistic of the passwords.

Of course, you should have a working python environment on your computer to use that.

Yulong
  • 518
  • 2
  • 5