19

I am using a dataset to practice for building a decision tree classifier.

Here is my code:

import pandas as pd 
tdf = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data', sep = ',', header=0)
tdf.info()

The column has no name, and i have problem to add the column name, already tried reindex, pd.melt, rename, etc.

The column names Ι want to assign are:

  1. Sample code number: id number
  2. Clump Thickness: 1 - 10
  3. Uniformity of Cell Size: 1 - 10
  4. Uniformity of Cell Shape: 1 - 10
  5. Marginal Adhesion: 1 - 10
  6. Single Epithelial Cell Size: 1 - 10
  7. Bare Nuclei: 1 - 10
  8. Bland Chromatin: 1 - 10
  9. Normal Nucleoli: 1 - 10
  10. Mitoses: 1 - 10
  11. Class: (2 for benign, 4 for malignant)

Thanks,

Tasos
  • 3,920
  • 4
  • 23
  • 54
user633599
  • 333
  • 1
  • 2
  • 9

3 Answers3

20

For any dataframe, say df , you can add/modify column names by passing the column names in a list to the df.columns method: For example, if you want the column names to be 'A', 'B', 'C', 'D'],use this:

df.columns = ['A', 'B', 'C', 'D']

In your code , can you remove header=0? This basically tells pandas to take the first row as the column headers . Once you remove that, use the above to assign the column names.

Gyan Ranjan
  • 811
  • 7
  • 12
11
df = pd.read_csv("Price Data.csv", names=['Date', 'Price'])

use the names field to add a header to your pandas dataframe.

Stephen Rauch
  • 1,783
  • 11
  • 22
  • 34
0

I tried the code above and you are missing the first line of data.

1. original

tdf = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data', sep = ',', header=0)
tdf.shape

(698, 11)

2. as the previous questions, removing header=0

tdf = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data', sep = ',')
tdf.shape

(698, 11)

3. new answer, adding column names while reading csv, does get all the rows

 tdf = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data', sep = ',', names=['Sample code number: id number','Clump Thickness: 1 - 10','Uniformity of Cell Size: 1 - 10','Uniformity of Cell Shape: 1 - 10','Marginal Adhesion: 1 - 10','Single Epithelial Cell Size: 1 - 10','Bare Nuclei: 1 - 10','Bland Chromatin: 1 - 10','Normal Nucleoli: 1 - 10','Mitoses: 1 - 10','Class: (2 for benign, 4 for malignant)'])  
    tdf.shape

(699, 11)

You can assign the names of the columns when reading the csv file

import pandas as pd 
tdf = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data', sep = ',', names=['Sample code number: id number','Clump Thickness: 1 - 10','Uniformity of Cell Size: 1 - 10','Uniformity of Cell Shape: 1 - 10','Marginal Adhesion: 1 - 10','Single Epithelial Cell Size: 1 - 10','Bare Nuclei: 1 - 10','Bland Chromatin: 1 - 10','Normal Nucleoli: 1 - 10','Mitoses: 1 - 10','Class: (2 for benign, 4 for malignant)'])

You can check the dataframe using

tdf.head()

and you get

enter image description here

You can check the code on https://gist.github.com/e94b31914dbaebda7d11f6bfe0cfbdec

daco
  • 111
  • 3