16

Is there a difference (in performance for example) when comparing shape and len? Consider the following example:

In [1]: import numpy as np

In [2]: a = np.array([1,2,3,4])

In [3]: a.shape
Out[3]: (4,)

In [4]: len(a)
Out[4]: 4

Quick runtime comparison suggests that there's no difference:

In [17]: a = np.random.randint(0,10000, size=1000000)

In [18]: %time a.shape
CPU times: user 6 µs, sys: 2 µs, total: 8 µs
Wall time: 13.1 µs
Out[18]: (1000000,)

In [19]: %time len(a)
CPU times: user 5 µs, sys: 1 µs, total: 6 µs
Wall time: 9.06 µs
Out[19]: 1000000

So, what is the difference and which one is more pythonic? (I guess using shape).

Dror
  • 11,068
  • 19
  • 81
  • 145

5 Answers5

18

I wouldn't worry about performance here - any differences should only be very marginal.

I'd say the more pythonic alternative is probably the one which matches your needs more closely:

a.shape may contain more information than len(a) since it contains the size along all axes whereas len only returns the size along the first axis:

>>> a = np.array([[1,2,3,4], [1,2,3,4]])
>>> len(a)
2
>>> a.shape
(2L, 4L)

If you actually happen to work with one-dimensional arrays only, than I'd personally favour using len(a) in case you explicitly need the array's size.

sebastian
  • 9,240
  • 23
  • 53
9

From the source code, it looks like shape basically uses len(): https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py

@property
def shape(self) -> Tuple[int, int]:
    return len(self.index), len(self.columns)
def __len__(self) -> int:
    return len(self.index)

Calling shape will attempt to run both dim calcs. So maybe df.shape[0] + df.shape[1] is slower than len(df.index) + len(df.columns). Still, performance-wise, the difference should be negligible except for a giant giant 2D dataframe.

So in line with the previous answers, df.shape is good if you need both dimensions, for a single dimension, len() seems more appropriate conceptually.

Looking at property vs method answers, it all points to usability and readability of code. So again, in your case, I would say if you want information about the whole dataframe just to check or for example to pass the shape tuple to a function, use shape. For a single column, including index (i.e. the rows of a df), use len().

Bish
  • 167
  • 2
  • 7
1

There is really (very small) a different. If you work on time-series data and know that the data is vector (1D), use len as it is faster, and make it habit, even if it is just very-very marginal. Bish's answer already explained what happens behind the scene.

Proper benchmark using %%timeit (I test is several times) resulting in len as the victor:

# tested on pandas DataFrame

%%timeit
len(yhat.values)
# 576 ns ± 1.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%%timeit
yhat.values.shape[0]
# 607 ns ± 1.07 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Furthermore, in 1D, len as length is more informative (when you read a code) than .shape[0].

  • 1
    got very similar results, len() function is marginally faster. https://twitter.com/pfedprog/status/1499894398032744450?s=20&t=tOd_np7pKpB4rFt8tN85Ow – Pavel Fedotov Mar 05 '22 at 00:03
0

For 1D case, both len and shape will produce same result. For other case, I shape will provide more information. It depends on program to program in which will provide you better performance. I suggest you to not to worry much about performance.

Ashiq Imran
  • 1,662
  • 16
  • 13
  • 1
    Try: `len(np.array([0,2])) , type(np.array([0,2]).shape)` . `len` returns an int, `shape` returns a tuple of ints. This is important if actually using results in code rather than inspecting by eye – Mark_Anderson Jun 19 '20 at 19:39
0
import numpy as np

x = np.linspace(1, 10, 10).reshape((5, 2))
print(x)
print(x.size)
print(len(x))

gives the following output:

[[ 1.  2.]
 [ 3.  4.]
 [ 5.  6.]
 [ 7.  8.]
 [ 9. 10.]]
10
5