Why dataframe.values is very slow

Question

I have a pandas dataframe called df. When I do the following:

A = df.values Then when I check: A is df.values It returns False.

Even though when I check type(A) and type(df.values), both rerun the type as numpy.ndarray

Also, my df has mixed data of strings and numerical data. The call A[10,1] is significantly faster than the call df.values[10,1

I am wondering what is happening when I call df.values?

Thanks

Andy Hayden · Answer 1 · 2014-03-04T03:02:38.743

For DataFrames of mixed type calling .values converts multiple blocks (for each dtype) into one numpy array of a all-encompassing dtype. This conversion can be slow for large frames.

In [11]: pd.DataFrame([[1, 2.]]).values
Out[11]: array([[ 1.,  2.]])

In [12]: pd.DataFrame([[1, 2., 'a']]).values
Out[12]: array([[1, 2.0, 'a']], dtype=object)

A example with timings:

In [21]: df = pd.DataFrame(np.random.randn(10000))

In [22]: %timeit df.values  # no conversion
100000 loops, best of 3: 3.72 µs per loop

In [23]: df[1] = 'a'  # add column of object dtype

In [24]: %timeit df.values  # conversion to object dtype
1000 loops, best of 3: 681 µs per loop

You can see how data is stored in the BlockManager via the ._data attribute.

To answer the question, since this values attribute is calculated each time the return numpy array has a different id / memory address and so A is df.values is False. You need to use something like numpy's array_equal:

In [31]: df.values is df.values
Out[31]: False

In [32]: np.array_equal(df.values, df.values)
Out[32]: True

Thanks for your explanation. Now I understands how the underlying works for .value call. — user3377229, Mar 04 '14 at 03:18

Why dataframe.values is very slow

1 Answers1

Linked