4

I want to work on NSF Research Award Abstracts 1990-2003 Data Set in R. Every record of the data is stored in a text file and text files are grouped in folders (See data files here)

Is there any easy and straightforward way to load the NSF data in Python or R?

Hamideh
  • 261
  • 1
  • 8
  • 1
    How much programming experience do you have with R? The way I would do it: iterate the two directory levels by two for loops. Read the files line-wise. Merge a line, which starts with a space, with the previous line (and add a separator like ,). Split each line at :. Write the data into one row or one column of a 2D array (or another 2D data structure). Proceed with the next file the same way and write the data into the next row or column. I would suggest to write the data row-wise into the array. – daniel.heydebreck Aug 04 '16 at 10:46
  • I don't know how similar the files are over the years, but this one example is begging for the data to be repackaged and put in a single sqlite3 database file for easier access. https://i.imgur.com/DTdDviN.png – philshem Mar 21 '17 at 12:15

1 Answers1

2

This is one way of working with the dataset in Python 3.5. The code and document are available here: https://github.com/HamidehIraj/ReadNSF

Import Libraries

import re
import os

Loading Data

indir = './Part1/'
# indir = './Part1-Complete/'

file_list=[]
year_list=[]
abstract_list=[]

for root, dirs, filenames in os.walk(indir):
    for file in filenames:

        # Filtering text files
        if file.endswith('.txt'):

            log = open(os.path.join(root, file),'r')
            raw = log.read()

            # Converting text file to string for regular expression purposes

            # Finding the abstract
            # Assuming that Abstract is the last item in all files

            pat_abstract=re.compile('Abstract.*',re.M|re.DOTALL)
            abstract=pat_abstract.findall(raw)

            # Converting list to string
            abstract=''.join(abstract)

#             print(abstract)
#             print(type(abstract))

            # Finding the term containing the start year
            # Assuming that both Start Date and Expires exist in all text files

            pat_year=re.compile('Start Date.*Expires',re.M|re.DOTALL)
            year_term=pat_year.findall(raw)

            # Converting list to string
            year_term=''.join(year_term)

            # Finding the start year. The result of the findall is a list
            year=re.findall('[1-2][0-9][0-9][0-9]',year_term)

            # converting list to integer
            for item in year:
                year=int(item)

#             print(type(year))
#             print(year)

            # Creating lists for filename, year and abstract
            # filename is saved for reference
            file_list.append(file)
            year_list.append(year)
            abstract_list.append(abstract)

# print(file_list)
# print(year_list)
# print(abstract_list)

print('Number of Files')
print(len(file_list))
# print(len(year_list))
# print(len(abstract_list))

A Sample of the Data

print(’FileName ’,’Year’, ’Abstract’, sep=’|’)
print(file_list[0] ,year_list[0], abstract_list[0], sep=’|’)

Data Preparation

clean_list=[]
punctuation = re.compile(r'[+-/\?!$%&,":;()<>@©*.|0-9]')

for element in abstract_list:

    # remove the term "Abstract"  from abstracts
    element=element.replace('Abstract', ' ')

    # convert to lowercase
    element=element.lower()

    # remove punctuation and numbers from abstracts
    element = punctuation.sub("", element)

    # remove multiple spaces from abstracts
    element=" ".join(element.split())

    # Creating the new list
    clean_list.append(element)

    print(clean_list[0] )
    abstract_list =  clean_list
Hamideh
  • 261
  • 1
  • 8