How to iterate over rows in a DataFrame in Pandas

Question

I have a DataFrame from Pandas:

import pandas as pd
inp = [{'c1':10, 'c2':100}, {'c1':11,'c2':110}, {'c1':12,'c2':120}]
df = pd.DataFrame(inp)
print df

Output:

Now I want to iterate over the rows of this frame. For every row I want to be able to access its elements (values in cells) by the name of the columns. For example:

for row in df.rows:
   print row['c1'], row['c2']

Is it possible to do that in Pandas?

I found this similar question. But it does not give me the answer I need. For example, it is suggested there to use:

for date, row in df.T.iteritems():

or

for row in df.iterrows():

But I do not understand what the row object is and how I can work with it.

The df.iteritems() iterates over columns and not rows. Thus, to make it iterate over rows, you have to transpose (the "T"), which means you change rows and columns into each other (reflect over diagonal). As a result, you effectively iterate the original dataframe over its rows when you use df.T.iteritems() — Stefan Gruenwald, Dec 14 '17 at 23:41
In contrast to what cs95 says, there are perfectly fine reasons to want to iterate over a dataframe, so new users should not feel discouraged. One example is if you want to execute some code using the values of each row as input. Also, if your dataframe is reasonably small (e.g. less than 1000 items), performance is not really an issue. — oulenz, Oct 16 '19 at 08:53
@cs95 It seems to me that dataframes are the go-to table format in Python. So whenever you want to read in a csv, or you have a list of dicts whose values you want to manipulate, or you want to perform simple join, groupby or window operations, you use a dataframe, even if your data is comparitively small. — oulenz, Nov 16 '19 at 12:19
@cs95 No, but this was in response to "using a DataFrame at all". My point is that this is why one may have one's data in a dataframe. If you then want to e.g. run a script for each line of your data, you have to iterate over that dataframe. — oulenz, Nov 16 '19 at 18:55
I second @oulenz. As far as I can tell `pandas` is the go-to choice of reading a csv file even if the dataset is small. It's simply easier programing to manipulate the data with APIs — F.S., Nov 18 '19 at 21:29
If you are a beginner to this thread and are not familiar with the pandas library, it's worth taking a step back and evaluating whether iteration is _indeed_ the solution to your problem. In some cases, it is. In most cases, it isn't. My post below introduces beginners to the library by easing them into the concept of vectorization so they know the difference between writing "good code", versus "code that just works" - and also know when to use which. Some folks are happy with the latter, they can continue to upvote @oulenz comment as much as they like. — cs95, Feb 25 '21 at 01:10
I need to generate a list of US states in two letters + population. What better way than iterating over. my df and using 'print' ? — user1854182, Apr 26 '21 at 05:41
use `df.apply`. For more info, see https://www.geeksforgeeks.org/apply-function-to-every-row-in-a-pandas-dataframe/ — Pavindu, May 26 '21 at 14:50

score 4504 · Accepted Answer · edited Jan 26 '22 at 23:10

4504

DataFrame.iterrows is a generator which yields both the index and row (as a Series):

import pandas as pd

df = pd.DataFrame({'c1': [10, 11, 12], 'c2': [100, 110, 120]})
df = df.reset_index()  # make sure indexes pair with number of rows
for index, row in df.iterrows():
    print(row['c1'], row['c2'])

10 100
11 110
12 120

edited Jan 26 '22 at 23:10

gcamargo

3,071
2
20
32

answered May 10 '13 at 07:07

waitingkuo

80,738
23
108
117

340

Note: "Because iterrows returns a Series for each row, it **does not** preserve dtypes across the rows." Also, "You **should never modify** something you are iterating over." According to [pandas 0.19.1 docs](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iterrows.html) – viddik13 Dec 07 '16 at 16:24
7

@viddik13 that's a great note thanks. Because of that I ran into a case where numerical values like `431341610650` where read as `4.31E+11`. Is there a way around preserving the dtypes? – Aziz Alto Sep 05 '17 at 16:30
46

@AzizAlto use `itertuples`, as explained below. See also http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.itertuples.html – Axel Sep 07 '17 at 11:45
How does the row object change if we dont use the index variable while iterating?? We have to use row[0],row[1] instead of the labels in that case? – Prateek Agrawal Oct 05 '17 at 18:22
165

Do not use iterrows. Itertuples is faster and preserves data type. [More info](https://stackoverflow.com/a/41022840/4180797) – James L. Dec 01 '17 at 16:14
2

if you don't need to preserve the datatype, iterrows is fine. @waitingkuo's tip to separate the index makes it much easier to parse. – beep_check May 03 '18 at 16:55
22

From [the documentation](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#iteration): "Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed[...]". Your answer is correct (in the context of the question) but does not mention this anywhere, so it isn't a very good one. – cs95 May 28 '19 at 05:00
7

This is the most copied answer on stackoverflow according to https://stackoverflow.blog/2021/04/19/how-often-do-people-actually-copy-and-paste-from-stack-overflow-now-we-know/ – pgp1 Aug 08 '21 at 13:34
@pgp1 It is indeed, and we can see why. It is very concise and effective :) – LiteApplication Sep 30 '21 at 14:31
It might not be recommended for speed reasons, but this way the index, the headers and the values become available in the loop without extra coding. In a list comprehension, this would take extra code: `[row for row in zip(df.index, df.to_numpy())]` and also create a dictionary with the mapping of each integer row number to the header name (`row[0]` is then mapped to the first column header, `row[1]` to the second asf.), but this is quite some effort if you just have this iterrows() / itertuples() / iteritems at hand and speed does not matter so much. Upvote although it is slow. – questionto42standswithUkraine Dec 07 '21 at 20:20

score 1876 · Answer 2 · edited Mar 19 '22 at 15:58

How to iterate over rows in a DataFrame in Pandas?

Answer: DON'T^*!

Iteration in Pandas is an anti-pattern and is something you should only do when you have exhausted every other option. You should not use any function with "iter" in its name for more than a few thousand rows or you will have to get used to a lot of waiting.

Do you want to print a DataFrame? Use DataFrame.to_string().

Do you want to compute something? In that case, search for methods in this order (list modified from here):

Vectorization
Cython routines
List Comprehensions (vanilla for loop)
DataFrame.apply(): i) Reductions that can be performed in Cython, ii) Iteration in Python space
DataFrame.itertuples() and iteritems()
DataFrame.iterrows()

iterrows and itertuples (both receiving many votes in answers to this question) should be used in very rare circumstances, such as generating row objects/nametuples for sequential processing, which is really the only thing these functions are useful for.

Appeal to Authority

The documentation page on iteration has a huge red warning box that says:

Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed [...].

_{* It's actually a little more complicated than "don't". df.iterrows() is the correct answer to this question, but "vectorize your ops" is the better one. I will concede that there are circumstances where iteration cannot be avoided (for example, some operations where the result depends on the value computed for the previous row). However, it takes some familiarity with the library to know when. If you're not sure whether you need an iterative solution, you probably don't. PS: To know more about my rationale for writing this answer, skip to the very bottom.}

Faster than Looping: Vectorization, Cython

A good number of basic operations and computations are "vectorised" by pandas (either through NumPy, or through Cythonized functions). This includes arithmetic, comparisons, (most) reductions, reshaping (such as pivoting), joins, and groupby operations. Look through the documentation on Essential Basic Functionality to find a suitable vectorised method for your problem.

If none exists, feel free to write your own using custom Cython extensions.

Next Best Thing: List Comprehensions^*

List comprehensions should be your next port of call if 1) there is no vectorized solution available, 2) performance is important, but not important enough to go through the hassle of cythonizing your code, and 3) you're trying to perform elementwise transformation on your code. There is a good amount of evidence to suggest that list comprehensions are sufficiently fast (and even sometimes faster) for many common Pandas tasks.

The formula is simple,

# Iterating over one column - `f` is some function that processes your data
result = [f(x) for x in df['col']]
# Iterating over two columns, use `zip`
result = [f(x, y) for x, y in zip(df['col1'], df['col2'])]
# Iterating over multiple columns - same data type
result = [f(row[0], ..., row[n]) for row in df[['col1', ...,'coln']].to_numpy()]
# Iterating over multiple columns - differing data type
result = [f(row[0], ..., row[n]) for row in zip(df['col1'], ..., df['coln'])]

If you can encapsulate your business logic into a function, you can use a list comprehension that calls it. You can make arbitrarily complex things work through the simplicity and speed of raw Python code.

Caveats

List comprehensions assume that your data is easy to work with - what that means is your data types are consistent and you don't have NaNs, but this cannot always be guaranteed.

The first one is more obvious, but when dealing with NaNs, prefer in-built pandas methods if they exist (because they have much better corner-case handling logic), or ensure your business logic includes appropriate NaN handling logic.
When dealing with mixed data types you should iterate over zip(df['A'], df['B'], ...) instead of df[['A', 'B']].to_numpy() as the latter implicitly upcasts data to the most common type. As an example if A is numeric and B is string, to_numpy() will cast the entire array to string, which may not be what you want. Fortunately zipping your columns together is the most straightforward workaround to this.

_{*Your mileage may vary for the reasons outlined in the Caveats section above.}

An Obvious Example

Let's demonstrate the difference with a simple example of adding two pandas columns A + B. This is a vectorizable operaton, so it will be easy to contrast the performance of the methods discussed above.

Benchmarking code, for your reference. The line at the bottom measures a function written in numpandas, a style of Pandas that mixes heavily with NumPy to squeeze out maximum performance. Writing numpandas code should be avoided unless you know what you're doing. Stick to the API where you can (i.e., prefer vec over vec_numpy).

I should mention, however, that it isn't always this cut and dry. Sometimes the answer to "what is the best method for an operation" is "it depends on your data". My advice is to test out different approaches on your data before settling on one.

My Personal Opinion ^*

Most of the analyses performed on the various alternatives to the iter family has been through the lens of performance. However, in most situations you will typically be working on a reasonably sized dataset (nothing beyond a few thousand or 100K rows) and performance will come second to simplicity/readability of the solution.

Here is my personal preference when selecting a method to use for a problem.

For the novice:

Vectorization (when possible); apply(); List Comprehensions; itertuples()/iteritems(); iterrows(); Cython

For the more experienced:

Vectorization (when possible); apply(); List Comprehensions; Cython; itertuples()/iteritems(); iterrows()

Vectorization prevails as the most idiomatic method for any problem that can be vectorized. Always seek to vectorize! When in doubt, consult the docs, or look on Stack Overflow for an existing question on your particular task.

I do tend to go on about how bad apply is in a lot of my posts, but I do concede it is easier for a beginner to wrap their head around what it's doing. Additionally, there are quite a few use cases for apply has explained in this post of mine.

Cython ranks lower down on the list because it takes more time and effort to pull off correctly. You will usually never need to write code with pandas that demands this level of performance that even a list comprehension cannot satisfy.

_{* As with any personal opinion, please take with heaps of salt!}

Why I Wrote this Answer

A common trend I notice from new users is to ask questions of the form "How can I iterate over my df to do X?". Showing code that calls iterrows() while doing something inside a for loop. Here is why. A new user to the library who has not been introduced to the concept of vectorization will likely envision the code that solves their problem as iterating over their data to do something. Not knowing how to iterate over a DataFrame, the first thing they do is Google it and end up here, at this question. They then see the accepted answer telling them how to, and they close their eyes and run this code without ever first questioning if iteration is the right thing to do.

The aim of this answer is to help new users understand that iteration is not necessarily the solution to every problem, and that better, faster and more idiomatic solutions could exist, and that it is worth investing time in exploring them. I'm not trying to start a war of iteration vs. vectorization, but I want new users to be informed when developing solutions to their problems with this library.

Note that there are important caveats with `iterrows` and `itertuples`. See [this answer](https://stackoverflow.com/a/41022840/3844376) and [pandas docs](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#iteration) for more details. — viddik13, May 30 '19 at 11:56
This is the only answer that focuses on the idiomatic techniques one should use with pandas, making it the best answer for this question. Learning to get the **right** *answer with the* **right** *code* (instead of the **right** *answer with the* **wrong** *code* - i.e. inefficient, doesn't scale, too fit to specific data) is a big part of learning pandas (and data in general). — LinkBerest, May 30 '19 at 14:26
I think you are being unfair to the for loop, though, seeing as they are only a bit slower than list comprehension in my tests. The trick is to loop over `zip(df['A'], df['B'])` instead of `df.iterrows()`. — Imperishable Night, Jun 24 '19 at 00:58
Ok, I get what you're saying, but if I need to print each row (with numeric data) of a table, sorted ascending - I guess there is no other way but to loop through the rows, right? — sdbbs, Nov 20 '19 at 13:57
@sdbbs there is, use sort_values to sort your data, then call to_string() on the result. — cs95, Nov 20 '19 at 15:37
Under List Comprehensions, the "iterating over multiple columns" example needs a caveat: `DataFrame.values` will convert every column to a common data type. `DataFrame.to_numpy()` does this too. Fortunately we can use `zip` with any number of columns. — David Wasserman, Jan 16 '20 at 20:44
@DavidWasserman that's a fantastic remark, thanks for your comments. Indeed that is something to watch out for with mixed columns unless you convert to object first (which, why would you)! — cs95, Jan 16 '20 at 20:52
Interesting, since `iterrows`, `apply` and list comprehension all seem to tend towards *O(n)* scalability I'd avoid any micro-optimisations and go with the most readable. A dataset too slow with any method is more likely in need of time spent finding solution that isn't *Pandas*, rather than trying to shave milliseconds off a `for`-loop. — c z, Jan 29 '20 at 18:00
@cz the plot is logarithmic. The difference in perf for larger datasets is in order of seconds and minutes, not milliseconds. — cs95, Jan 29 '20 at 19:08
I know I'm late to the answering party, but if you convert the dataframe to a numpy array and then use vectorization, it's even faster than pandas dataframe vectorization, (and that includes the time to turn it back into a dataframe series). For example: `def np_vectorization(df):` `np_arr = df.to_numpy()` `return pd.Series(np_arr[:,0] + np_arr[:,1], index=df.index)` And... `def just_np_vectorization(df):` `np_arr = df.to_numpy()` `return np_arr[:,0] + np_arr[:,1]` — bug_spray, Mar 24 '20 at 05:24
@AndreRicardo why not post that in an answer where it becomes more visible? — cs95, Mar 24 '20 at 06:01
This is actually what I was having a hard time finding going down the google path described in the answer. Thanks for it! — Mike_K, May 11 '20 at 02:05
Unfortunately some of us don't have an option to follow your suggestion. Because some libraries just force use of DataFrame, unnecessarily. (I got here trying to iterate over a parquet file in Python without Spark and transform the data to JSON. And I'm forced to use DataFrame) If you write libraries - please remember to not push Pandas on us. — Aleksandr Panzin, May 20 '20 at 23:20
@Dean I get this response quite often and it honestly confuses me. It's all about forming good habits. "My data is small and performance doesn't matter so my use of this antipattern can be excused" ..? When performance actually does matter one day, you'll thank yourself for having prepared the right tools in advance. — cs95, Jul 26 '20 at 04:46
The code examples following "The formula is simple,..." are all about iterating over a column, or a hard-coded group of columns. Some of us want to do one-operation-per-row of all the data in the row, i.e. using all the numbers and text (e.g. filenames), without having to hard-code every single column name. Editing the code example to make it clearer how do *entire rows* would improve this answer. — sh37211, Jul 29 '21 at 20:18
@sh37211 the "iterating over multiple types" case can be broadened to address that requirement: `result = [f(row[0], ..., row[n]) for row in df.to_numpy()]` should just work, and no hardcoding required. Is this not what you want? — cs95, Aug 01 '21 at 19:02
Are there opportunities to improve e.g. dask with pandas in selecting from amongst these methods of almost-actual-vectorization? (FWIW, for *distributed* pandas with *dask*, there exist `cudf.DataFrame.applymap()` and `dask.dataframe.DataFrame.map_partitions()` https://docs.rapids.ai/api/cudf/nightly/user_guide/10min.html#Applymap ) — Wes Turner, Jan 08 '22 at 16:31
That is a very good explanation. Nevertheless, someone may need to use data (e.g. URLs) from the dataframe to download files. In that case, iterrows() is the way forward. — katamayros, Jan 20 '22 at 15:17
I hate this dumbing down. If your data gets big enough, just use scala. Use all the for loops you want. — Geoff Langenderfer, Apr 04 '22 at 07:35
@GeoffLangenderfer "change the language" seems like using a jackhammer to kill a mosquito. Imagine forcing yourself to learn a new language because you need to satisfy your insatiable hunger to write for loops :D — cs95, Apr 05 '22 at 07:02
Hi @cs95 What would recommend in my case: I have a bunch of geographical coordinates (ranging from few thousands to millions) which I need to loop through. For each coordinate, I need to open an X amount of GIS-raster subimages around the coordinate point, do processing for these subimages with N custom 3rd party functions (e.g. Gabor filters etc.) and then save these results. I assume my process is not vectorizable so my only option is to use `iterrows` here? Or, I would need to make a highly custom code in order to do this with vectorization? — jjepsuomi, May 05 '22 at 11:11
I am trying to print a statement with values in first two columns if the third column is True, for each row; I can't seem to see anything but iterators/for loop to work; at 1.4.2 the `iterrows` is still here to stay, I guess the library does think there are valid scenarios for them? — stucash, May 10 '22 at 15:03

score 517 · Answer 3 · edited Apr 30 '22 at 18:45

517

First consider if you really need to iterate over rows in a DataFrame. See this answer for alternatives.

If you still need to iterate over rows, you can use methods below. Note some important caveats which are not mentioned in any of the other answers.

DataFrame.iterrows()

  for index, row in df.iterrows():
      print(row["c1"], row["c2"])

DataFrame.itertuples()

  for row in df.itertuples(index=True, name='Pandas'):
      print(row.c1, row.c2)

itertuples() is supposed to be faster than iterrows()

But be aware, according to the docs (pandas 0.24.2 at the moment):

iterrows: dtype might not match from row to row

Because iterrows returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames). To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns namedtuples of the values and which is generally much faster than iterrows()

iterrows: Do not modify rows

You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.

Use DataFrame.apply() instead:

    new_df = df.apply(lambda x: x * 2, axis = 1)

itertuples:

The column names will be renamed to positional names if they are invalid Python identifiers, repeated, or start with an underscore. With a large number of columns (>255), regular tuples are returned.

See pandas docs on iteration for more details.

edited Apr 30 '22 at 18:45

George

2,816
1
25
60

answered Dec 07 '16 at 16:41

viddik13

5,470
2
12
21

6

Just a small question from someone reading this thread so long after its completion: how df.apply() compares to itertuples in terms of efficiency? – Raul Guarini Jan 26 '18 at 13:16
7

Note: you can also say something like `for row in df[['c1','c2']].itertuples(index=True, name=None):` to include only certain columns in the row iterator. – Brian Burns Jun 29 '18 at 07:29
13

Instead of `getattr(row, "c1")`, you can use just `row.c1`. – viraptor Aug 13 '18 at 06:20
1

I am about 90% sure that if you use `getattr(row, "c1")` instead of `row.c1`, you lose any performance advantage of `itertuples`, and if you actually need to get to the property via a string, you should use iterrows instead. – Noctiphobia Aug 24 '18 at 10:34
1

When I tried this it only printed the column values but not the headers. Are the column headers excluded from the row attributes? – Marlo Dec 06 '18 at 05:39
3

I have stumbled upon this question because, although I knew there's split-apply-combine, I still *really needed to iterate* over a DataFrame (as the question states). Not everyone has the luxury to improve with `numba` and `cython` (the same docs say that "It’s always worth optimising in Python first"). I wrote this answer to help others avoid (sometimes frustrating) issues as none of the other answers mention these caveats. Misleading anyone or telling "that's the right thing to do" was never my intention. I have improved the answer. – viddik13 May 30 '19 at 12:32
And what if I want to loop through a dataframe with a step size greater than 1, e.g. select only every 3rd row? Thank you – Confounded Dec 16 '19 at 17:36
1

@Confounded A quick google reveals that you can use iloc to preprocess the dataframe: `df.iloc[::5, :]` will give you each 5th row. See [this question](https://stackoverflow.com/questions/25055712/pandas-every-nth-row) for more details. – viddik13 Dec 16 '19 at 22:39
1

FYI the 'pandas docs on iteration' link is broken. – David Doria Jun 18 '21 at 12:50
I don't know why, but using `name=None` make `itertuples` 50% faster in my use case. – Muhammad Yasirroni Dec 05 '21 at 07:13

score 236 · Answer 4 · edited Dec 11 '19 at 18:42

236

You should use df.iterrows(). Though iterating row-by-row is not especially efficient since Series objects have to be created.

edited Dec 11 '19 at 18:42

cs95

330,695
80
606
657

answered May 24 '12 at 14:24

Wes McKinney

93,141
30
140
108

14

Is this faster than converting the DataFrame to a numpy array (via .values) and operating on the array directly? I have the same problem, but ended up converting to a numpy array and then using cython. – vgoklani Oct 07 '12 at 12:26
12

@vgoklani If iterating row-by-row is inefficient and you have a non-object numpy array then almost surely using the raw numpy array will be faster, especially for arrays with many rows. you should avoid iterating over rows unless you absolutely have to – Phillip Cloud Jun 15 '13 at 21:06
8

I have done a bit of testing on the time consumption for df.iterrows(), df.itertuples(), and zip(df['a'], df['b']) and posted the result in the answer of another question: http://stackoverflow.com/a/34311080/2142098 – Richard Wong Dec 16 '15 at 11:41

e9t · Answer 5 · 2016-06-01T09:00:01.160

176

While iterrows() is a good option, sometimes itertuples() can be much faster:

df = pd.DataFrame({'a': randn(1000), 'b': randn(1000),'N': randint(100, 1000, (1000)), 'x': 'x'})

%timeit [row.a * 2 for idx, row in df.iterrows()]
# => 10 loops, best of 3: 50.3 ms per loop

%timeit [row[1] * 2 for row in df.itertuples()]
# => 1000 loops, best of 3: 541 µs per loop

edited Jun 01 '16 at 09:00

answered Sep 20 '15 at 13:52

e9t

14,205
5
22
23

9

Much of the time difference in your two examples seems like it is due to the fact that you appear to be using label-based indexing for the .iterrows() command and integer-based indexing for the .itertuples() command. – Alex Sep 20 '15 at 17:00
4

For a finance data based dataframe(timestamp, and 4x float), itertuples is 19,57 times faster then iterrows on my machine. Only `for a,b,c in izip(df["a"],df["b"],df["c"]:` is almost equally fast. – harbun Oct 19 '15 at 13:03
9

Can you explain why it's faster? – Abe Miessler Jan 10 '17 at 22:05
7

@AbeMiessler `iterrows()` boxes each row of data into a Series, whereas `itertuples()`does not. – miradulo Feb 13 '17 at 17:30
5

Note that the order of the columns is actually indeterminate, because `df` is created from a dictionary, so `row[1]` could refer to any of the columns. As it turns out though the times are roughly the same for the integer vs the float columns. – Brian Burns Nov 05 '17 at 17:29
@jeffhale the times you cite are exactly the same, how is that possible? Also, I meant something like row.iat[1] when I was referring to integer-based indexing. – Alex Sep 28 '18 at 21:57
@Alex that does look suspicious. I just reran it a few times and itertuples took 3x longer than iterrows. With pandas 0.23.4. Will delete the other comment to avoid confusion. – jeffhale Sep 28 '18 at 23:33
Then running on a much larger DataFrame, more like a real-world situation, itertuples was 100x faster than iterrows. Itertuples for the win. – jeffhale Sep 28 '18 at 23:40
1

I get a >50 times increase as well https://i.stack.imgur.com/HBe9o.png (while changing to attr accessor in the second run). – Ajasja Nov 07 '18 at 20:53

score 126 · Answer 6 · edited Jan 08 '22 at 22:42

126

You can use the df.iloc function as follows:

for i in range(0, len(df)):
    print(df.iloc[i]['c1'], df.iloc[i]['c2'])

edited Jan 08 '22 at 22:42

Tonechas

12,665
15
42
74

answered Sep 07 '16 at 12:56

PJay

2,157
1
12
12

1

I know that one should avoid this in favor of iterrows or itertuples, but it would be interesting to know why. Any thoughts? – rocarvaj Oct 05 '17 at 14:50
19

This is the only valid technique I know of if you want to preserve the data types, and also refer to columns by name. `itertuples` preserves data types, but gets rid of any name it doesn't like. `iterrows` does the opposite. – Ken Williams Jan 18 '18 at 19:22
7

Spent hours trying to wade through the idiosyncrasies of pandas data structures to do something simple AND expressive. This results in readable code. – Sean Anderson Sep 19 '18 at 12:13
1

While ```for i in range(df.shape[0])``` might speed this approach up a bit, it's still about 3.5x slower than the iterrows() approach above for my application. – Kim Miller Dec 14 '18 at 18:18
2

On large Datafrmes this seems better as `my_iter = df.itertuples()` takes double the memory and a lot of time to copy it. same for `iterrows()`. – Bastiaan Jan 03 '19 at 22:07

score 117 · Answer 7 · answered Jun 01 '15 at 06:24

117

You can also use df.apply() to iterate over rows and access multiple columns for a function.

docs: DataFrame.apply()

def valuation_formula(x, y):
    return x * y * 0.5

df['price'] = df.apply(lambda row: valuation_formula(row['x'], row['y']), axis=1)

answered Jun 01 '15 at 06:24

cheekybastard

5,285
3
21
26

Is the df['price'] refers to a column name in the data frame? I am trying to create a dictionary with unique values from several columns in a csv file. I used your logic to create a dictionary with unique keys and values and got an error stating **TypeError: ("'Series' objects are mutable, thus they cannot be hashed", u'occurred at index 0')** – SRS Jul 01 '15 at 17:55
**Code:** df['Workclass'] = df.apply(lambda row: dic_update(row), axis=1) **end of line** id = 0 **end of line** def dic_update(row): if row not in dic: dic[row] = id id = id + 1 – SRS Jul 01 '15 at 17:57
Never mind, I got it. Changed the function call line to **df_new = df['Workclass'].apply(same thing)** – SRS Jul 01 '15 at 19:06
3

Having the axis default to 0 is the worst – zthomas.nc Nov 29 '17 at 23:58
12

Notice that `apply` doesn't "iteratite" over rows, rather it applies a function row-wise. The above code wouldn't work if you really *do* need iterations and indeces, for instance when comparing values across different rows (in that case you can do nothing but iterating). – gented Apr 04 '18 at 13:44
@gented ...where did you see the word "iteratite" here? – cs95 Jun 29 '19 at 20:54
1

this is the appropriate answer for pandas – dhruvm Jul 25 '20 at 20:14

score 60 · Answer 8 · edited Jun 11 '20 at 13:43

How to iterate efficiently

If you really have to iterate a Pandas dataframe, you will probably want to avoid using iterrows(). There are different methods and the usual iterrows() is far from being the best. itertuples() can be 100 times faster.

In short:

As a general rule, use df.itertuples(name=None). In particular, when you have a fixed number columns and less than 255 columns. See point (3)
Otherwise, use df.itertuples() except if your columns have special characters such as spaces or '-'. See point (2)
It is possible to use itertuples() even if your dataframe has strange columns by using the last example. See point (4)
Only use iterrows() if you cannot the previous solutions. See point (1)

Different methods to iterate over rows in a Pandas dataframe:

Generate a random dataframe with a million rows and 4 columns:

    df = pd.DataFrame(np.random.randint(0, 100, size=(1000000, 4)), columns=list('ABCD'))
    print(df)

1) The usual iterrows() is convenient, but damn slow:

start_time = time.clock()
result = 0
for _, row in df.iterrows():
    result += max(row['B'], row['C'])

total_elapsed_time = round(time.clock() - start_time, 2)
print("1. Iterrows done in {} seconds, result = {}".format(total_elapsed_time, result))

2) The default itertuples() is already much faster, but it doesn't work with column names such as My Col-Name is very Strange (you should avoid this method if your columns are repeated or if a column name cannot be simply converted to a Python variable name).:

start_time = time.clock()
result = 0
for row in df.itertuples(index=False):
    result += max(row.B, row.C)

total_elapsed_time = round(time.clock() - start_time, 2)
print("2. Named Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))

3) The default itertuples() using name=None is even faster but not really convenient as you have to define a variable per column.

start_time = time.clock()
result = 0
for(_, col1, col2, col3, col4) in df.itertuples(name=None):
    result += max(col2, col3)

total_elapsed_time = round(time.clock() - start_time, 2)
print("3. Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))

4) Finally, the named itertuples() is slower than the previous point, but you do not have to define a variable per column and it works with column names such as My Col-Name is very Strange.

start_time = time.clock()
result = 0
for row in df.itertuples(index=False):
    result += max(row[df.columns.get_loc('B')], row[df.columns.get_loc('C')])

total_elapsed_time = round(time.clock() - start_time, 2)
print("4. Polyvalent Itertuples working even with special characters in the column name done in {} seconds, result = {}".format(total_elapsed_time, result))

Output:

         A   B   C   D
0       41  63  42  23
1       54   9  24  65
2       15  34  10   9
3       39  94  82  97
4        4  88  79  54
...     ..  ..  ..  ..
999995  48  27   4  25
999996  16  51  34  28
999997   1  39  61  14
999998  66  51  27  70
999999  51  53  47  99

[1000000 rows x 4 columns]

1. Iterrows done in 104.96 seconds, result = 66151519
2. Named Itertuples done in 1.26 seconds, result = 66151519
3. Itertuples done in 0.94 seconds, result = 66151519
4. Polyvalent Itertuples working even with special characters in the column name done in 2.94 seconds, result = 66151519

This article is a very interesting comparison between iterrows and itertuples

So WHY are these inefficient methods available in Pandas in the first place - if it's "common knowledge" that iterrows and itertuples should not be used - then why are they there, or rather, why are those methods not updated and made more efficient in the background by the maintainers of Pandas? — Monty, Jan 05 '22 at 07:23
@Monty, it's not always possible to vectorize all operations. — Romain Capron, Jan 05 '22 at 15:34

score 47 · Answer 9 · edited Jun 11 '20 at 13:37

47

I was looking for How to iterate on rows and columns and ended here so:

for i, row in df.iterrows():
    for j, column in row.iteritems():
        print(column)

edited Jun 11 '20 at 13:37

Peter Mortensen

30,030
21
100
124

answered Jan 17 '18 at 09:41

Lucas B

1,868
1
19
20

When possible, you should avoid using iterrows(). I explain why in the answer [How to iterate efficiently](https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas/59413206#59413206) – Romain Capron Jul 20 '20 at 09:00

piRSquared · Answer 10 · 2017-11-07T04:29:57.287

You can write your own iterator that implements namedtuple

from collections import namedtuple

def myiter(d, cols=None):
    if cols is None:
        v = d.values.tolist()
        cols = d.columns.values.tolist()
    else:
        j = [d.columns.get_loc(c) for c in cols]
        v = d.values[:, j].tolist()

    n = namedtuple('MyTuple', cols)

    for line in iter(v):
        yield n(*line)

This is directly comparable to pd.DataFrame.itertuples. I'm aiming at performing the same task with more efficiency.

For the given dataframe with my function:

list(myiter(df))

[MyTuple(c1=10, c2=100), MyTuple(c1=11, c2=110), MyTuple(c1=12, c2=120)]

Or with pd.DataFrame.itertuples:

list(df.itertuples(index=False))

[Pandas(c1=10, c2=100), Pandas(c1=11, c2=110), Pandas(c1=12, c2=120)]

A comprehensive test
We test making all columns available and subsetting the columns.

def iterfullA(d):
    return list(myiter(d))

def iterfullB(d):
    return list(d.itertuples(index=False))

def itersubA(d):
    return list(myiter(d, ['col3', 'col4', 'col5', 'col6', 'col7']))

def itersubB(d):
    return list(d[['col3', 'col4', 'col5', 'col6', 'col7']].itertuples(index=False))

res = pd.DataFrame(
    index=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
    columns='iterfullA iterfullB itersubA itersubB'.split(),
    dtype=float
)

for i in res.index:
    d = pd.DataFrame(np.random.randint(10, size=(i, 10))).add_prefix('col')
    for j in res.columns:
        stmt = '{}(d)'.format(j)
        setp = 'from __main__ import d, {}'.format(j)
        res.at[i, j] = timeit(stmt, setp, number=100)

res.groupby(res.columns.str[4:-1], axis=1).plot(loglog=True);

For people who don't want to read the code: blue line is `intertuples`, orange line is a list of an iterator thru a yield block. `interrows` is not compared. — James L., Dec 01 '17 at 16:06

Pedro Lobito · Answer 11 · 2017-04-04T20:46:53.023

20

To loop all rows in a dataframe you can use:

for x in range(len(date_example.index)):
    print date_example['Date'].iloc[x]

edited Apr 04 '17 at 20:46

answered Mar 11 '17 at 22:44

Pedro Lobito

85,689
29
230
253

1

This is chained indexing. I do not recommend doing this. – cs95 Apr 18 '19 at 23:20
@cs95 What would you recommend instead? – Pedro Lobito Apr 19 '19 at 01:42
If you want to make this work, call df.columns.get_loc to get the integer index position of the date column (outside the loop), then use a single iloc indexing call inside. – cs95 Apr 19 '19 at 01:57

score 20 · Answer 12 · edited May 07 '19 at 06:37

20

 for ind in df.index:
     print df['c1'][ind], df['c2'][ind]

edited May 07 '19 at 06:37

cs95

330,695
80
606
657

answered Nov 02 '17 at 10:33

Grag2015

519
8
12

1

how is the performance of this option when used on a large dataframe (millions of rows for example)? – Bazyli Debowski Sep 10 '18 at 12:41
Honestly, I don’t know exactly, I think that in comparison with the best answer, the elapsed time will be about the same, because both cases use "for"-construction. But the memory may be different in some cases. – Grag2015 Oct 25 '18 at 13:52
4

This is chained indexing. Do not use this! – cs95 Apr 18 '19 at 23:19

Sachin · Answer 13 · 2022-01-06T09:46:42.817

17

We have multiple options to do the same, lots of folks have shared their answers.

I found below two methods easy and efficient to do :

Example:

 import pandas as pd
 inp = [{'c1':10, 'c2':100}, {'c1':11,'c2':110}, {'c1':12,'c2':120}]
 df = pd.DataFrame(inp)
 print (df)

 #With iterrows method 

 for index, row in df.iterrows():
     print(row["c1"], row["c2"])

 #With itertuples method

 for row in df.itertuples(index=True, name='Pandas'):
     print(row.c1, row.c2)

Note: itertuples() is supposed to be faster than iterrows()

edited Jan 06 '22 at 09:46

answered Nov 24 '21 at 12:39

Sachin

800
8
20

3

This actually answers the question. +1 – Joe Coder Dec 02 '21 at 07:11

score 14 · Answer 14 · edited Apr 13 '19 at 23:06

Sometimes a useful pattern is:

# Borrowing @KutalmisB df example
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]}, index=['a', 'b'])
# The to_dict call results in a list of dicts
# where each row_dict is a dictionary with k:v pairs of columns:value for that row
for row_dict in df.to_dict(orient='records'):
    print(row_dict)

Which results in:

{'col1':1.0, 'col2':0.1}
{'col1':2.0, 'col2':0.2}

bug_spray · Answer 15 · 2021-08-27T05:47:45.280

13

Update: cs95 has updated his answer to include plain numpy vectorization. You can simply refer to his answer.

cs95 shows that Pandas vectorization far outperforms other Pandas methods for computing stuff with dataframes.

I wanted to add that if you first convert the dataframe to a NumPy array and then use vectorization, it's even faster than Pandas dataframe vectorization, (and that includes the time to turn it back into a dataframe series).

If you add the following functions to cs95's benchmark code, this becomes pretty evident:

def np_vectorization(df):
    np_arr = df.to_numpy()
    return pd.Series(np_arr[:,0] + np_arr[:,1], index=df.index)

def just_np_vectorization(df):
    np_arr = df.to_numpy()
    return np_arr[:,0] + np_arr[:,1]

edited Aug 27 '21 at 05:47

answered Mar 24 '20 at 17:57

bug_spray

1,285
1
8
21

how do did you plot this? – wwnde Aug 29 '21 at 00:20
1

[cs95's benchmarking code, for your reference](https://gist.github.com/Coldsp33d/948f96b384ca5bdf6e8ce203ac97c9a0/revisions) – bug_spray Sep 01 '21 at 02:40

score 12 · Answer 16 · edited Apr 21 '21 at 16:42

12

In short

Use vectorization if possible
If an operation can't be vectorized - use list comprehensions
If you need a single object representing the entire row - use itertuples
If the above is too slow - try swifter.apply
If it's still too slow - try a Cython routine

Benchmark

edited Apr 21 '21 at 16:42

Peter Mortensen

30,030
21
100
124

answered Jun 01 '20 at 16:22

artoby

1,184
11
13

Herpes Free Engineer · Answer 17 · 2018-04-24T08:48:05.130

To loop all rows in a dataframe and use values of each row conveniently, namedtuples can be converted to ndarrays. For example:

df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]}, index=['a', 'b'])

Iterating over the rows:

for row in df.itertuples(index=False, name='Pandas'):
    print np.asarray(row)

results in:

[ 1.   0.1]
[ 2.   0.2]

Please note that if index=True, the index is added as the first element of the tuple, which may be undesirable for some applications.

score 11 · Answer 18 · answered Oct 17 '19 at 15:26

11

There is a way to iterate throw rows while getting a DataFrame in return, and not a Series. I don't see anyone mentioning that you can pass index as a list for the row to be returned as a DataFrame:

for i in range(len(df)):
    row = df.iloc[[i]]

Note the usage of double brackets. This returns a DataFrame with a single row.

answered Oct 17 '19 at 15:26

Zeitgeist

1,162
1
15
25

This was very helpful for getting the nth largest row in a data frame after sorting. Thanks! – Jason Harrison Dec 03 '19 at 05:23

Hossein Kalbasi · Answer 19 · 2020-02-28T17:51:44.053

For both viewing and modifying values, I would use iterrows(). In a for loop and by using tuple unpacking (see the example: i, row), I use the row for only viewing the value and use i with the loc method when I want to modify values. As stated in previous answers, here you should not modify something you are iterating over.

for i, row in df.iterrows():
    df_column_A = df.loc[i, 'A']
    if df_column_A == 'Old_Value':
        df_column_A = 'New_value'

Here the row in the loop is a copy of that row, and not a view of it. Therefore, you should NOT write something like row['A'] = 'New_Value', it will not modify the DataFrame. However, you can use i and loc and specify the DataFrame to do the work.

score 7 · Answer 20 · edited Jun 11 '20 at 13:38

7

There are so many ways to iterate over the rows in Pandas dataframe. One very simple and intuitive way is:

df = pd.DataFrame({'A':[1, 2, 3], 'B':[4, 5, 6], 'C':[7, 8, 9]})
print(df)
for i in range(df.shape[0]):
    # For printing the second column
    print(df.iloc[i, 1])

    # For printing more than one columns
    print(df.iloc[i, [0, 2]])

edited Jun 11 '20 at 13:38

Peter Mortensen

30,030
21
100
124

answered Jan 19 '19 at 06:53

shubham ranjan

369
3
7

score 7 · Answer 21 · answered Nov 02 '20 at 21:35

7

The easiest way, use the apply function

def print_row(row):
   print row['c1'], row['c2']

df.apply(lambda row: print_row(row), axis=1)

answered Nov 02 '20 at 21:35

François B.

993
6
18

JohnE · Answer 22 · 2021-07-23T18:29:01.630

As many answers here correctly and clearly point out, you should not generally attempt to loop in Pandas, but rather should write vectorized code. But the question remains if you should ever write loops in Pandas, and if so the best way to loop in those situations.

I believe there is at least one general situation where loops are appropriate: when you need to calculate some function that depends on values in other rows in a somewhat complex manner. In this case, the looping code is often simpler, more readable, and less error prone than vectorized code. The looping code might even be faster, too.

I will attempt to show this with an example. Suppose you want to take a cumulative sum of a column, but reset it whenever some other column equals zero:

import pandas as pd
import numpy as np

df = pd.DataFrame( { 'x':[1,2,3,4,5,6], 'y':[1,1,1,0,1,1]  } )

#   x  y  desired_result
#0  1  1               1
#1  2  1               3
#2  3  1               6
#3  4  0               4
#4  5  1               9
#5  6  1              15

This is a good example where you could certainly write one line of Pandas to achieve this, although it's not especially readable, especially if you aren't fairly experienced with Pandas already:

df.groupby( (df.y==0).cumsum() )['x'].cumsum()

That's going to be fast enough for most situations, although you could also write faster code by avoiding the groupby, but it will likely be even less readable.

Alternatively, what if we write this as a loop? You could do something like the following with NumPy:

import numba as nb

@nb.jit(nopython=True)  # Optional
def custom_sum(x,y):
    x_sum = x.copy()
    for i in range(1,len(df)):
        if y[i] > 0: x_sum[i] = x_sum[i-1] + x[i]
    return x_sum

df['desired_result'] = custom_sum( df.x.to_numpy(), df.y.to_numpy() )

Admittedly, there's a bit of overhead there required to convert DataFrame columns to NumPy arrays, but the core piece of code is just one line of code that you could read even if you didn't know anything about Pandas or NumPy:

if y[i] > 0: x_sum[i] = x_sum[i-1] + x[i]

And this code is actually faster than the vectorized code. In some quick tests with 100,000 rows, the above is about 10x faster than the groupby approach. Note that one key to the speed there is numba, which is optional. Without the "@nb.jit" line, the looping code is actually about 10x slower than the groupby approach.

Clearly this example is simple enough that you would likely prefer the one line of pandas to writing a loop with its associated overhead. However, there are more complex versions of this problem for which the readability or speed of the NumPy/numba loop approach likely makes sense.

score 4 · Answer 23 · edited Mar 31 '22 at 07:36

4

df.iterrows() returns tuple(a, b) where a is the index and b is the row.

edited Mar 31 '22 at 07:36

pythonic833

2,904
1
11
25

answered Jul 03 '21 at 06:58

Ashvani Jaiswal

90
1
5

score 3 · Answer 24 · edited Jun 11 '20 at 13:37

You can also do NumPy indexing for even greater speed ups. It's not really iterating but works much better than iteration for certain applications.

subset = row['c1'][0:5]
all = row['c1'][:]

You may also want to cast it to an array. These indexes/selections are supposed to act like NumPy arrays already, but I ran into issues and needed to cast

np.asarray(all)
imgs[:] = cv2.resize(imgs[:], (224,224) ) # Resize every image in an hdf5 file

gru · Answer 25 · 2022-04-08T16:02:59.957

Disclaimer: Although here are so many answers which recommend not using an iterative (loop) approach (and I mostly agree), I would still see it as a reasonable approach for the following situation:

Extend dataframe with data from API

Let's say you have a large dataframe which contains incomplete user data. Now you have to extend this data with additional columns, for example the user's age and gender.

Both values have to be fetched from a backend API. I'm assuming the API doesn't provide a "batch" endpoint (which would accept multiple user IDs at once). Otherwise, you should rather call the API only once.

The costs (waiting time) for the network request surpass the iteration of the dataframe by far. We're talking about network roundtrip times of hundreds of milliseconds compared to the negligibly small gains in using alternative approaches to iterations.

1 expensive network request for each row

So in this case, I would absolutely prefer using an iterative approach. Although the network request is expensive, it is guaranteed being triggered only once for each row in the dataframe. Here is an example using DataFrame.iterrows:

Example

for index, row in users_df.iterrows():
  user_id = row['user_id']
  # trigger expensive network request once for each row
  response_dict = backend_api.get(f'/api/user-data/{user_id}')
  # extend dataframe with multiple data from response
  users_df.at[index, 'age'] = response_dict.get('age')
  users_df.at[index, 'gender'] = response_dict.get('gender')

score 2 · Answer 26 · answered Mar 16 '19 at 22:33

This example uses iloc to isolate each digit in the data frame.

import pandas as pd

 a = [1, 2, 3, 4]
 b = [5, 6, 7, 8]

 mjr = pd.DataFrame({'a':a, 'b':b})

 size = mjr.shape

 for i in range(size[0]):
     for j in range(size[1]):
         print(mjr.iloc[i, j])

score 2 · Answer 27 · answered Dec 10 '19 at 09:36

Some libraries (e.g. a Java interop library that I use) require values to be passed in a row at a time, for example, if streaming data. To replicate the streaming nature, I 'stream' my dataframe values one by one, I wrote the below, which comes in handy from time to time.

class DataFrameReader:
  def __init__(self, df):
    self._df = df
    self._row = None
    self._columns = df.columns.tolist()
    self.reset()
    self.row_index = 0

  def __getattr__(self, key):
    return self.__getitem__(key)

  def read(self) -> bool:
    self._row = next(self._iterator, None)
    self.row_index += 1
    return self._row is not None

  def columns(self):
    return self._columns

  def reset(self) -> None:
    self._iterator = self._df.itertuples()

  def get_index(self):
    return self._row[0]

  def index(self):
    return self._row[0]

  def to_dict(self, columns: List[str] = None):
    return self.row(columns=columns)

  def tolist(self, cols) -> List[object]:
    return [self.__getitem__(c) for c in cols]

  def row(self, columns: List[str] = None) -> Dict[str, object]:
    cols = set(self._columns if columns is None else columns)
    return {c : self.__getitem__(c) for c in self._columns if c in cols}

  def __getitem__(self, key) -> object:
    # the df index of the row is at index 0
    try:
        if type(key) is list:
            ix = [self._columns.index(key) + 1 for k in key]
        else:
            ix = self._columns.index(key) + 1
        return self._row[ix]
    except BaseException as e:
        return None

  def __next__(self) -> 'DataFrameReader':
    if self.read():
        return self
    else:
        raise StopIteration

  def __iter__(self) -> 'DataFrameReader':
    return self

Which can be used:

for row in DataFrameReader(df):
  print(row.my_column_name)
  print(row.to_dict())
  print(row['my_column_name'])
  print(row.tolist())

And preserves the values/ name mapping for the rows being iterated. Obviously, is a lot slower than using apply and Cython as indicated above, but is necessary in some circumstances.

score 1 · Answer 28 · edited Oct 31 '20 at 13:57

Along with the great answers in this post I am going to propose Divide and Conquer approach, I am not writing this answer to abolish the other great answers but to fulfill them with another approach which was working efficiently for me. It has two steps of splitting and merging the pandas dataframe:

PROS of Divide and Conquer:

You don't need to use vectorization or any other methods to cast the type of your dataframe into another type
You don't need to Cythonize your code which normally takes extra time from you
Both iterrows() and itertuples() in my case were having the same performance over entire dataframe
Depends on your choice of slicing index, you will be able to exponentially quicken the iteration. The higher index, the quicker your iteration process.

CONS of Divide and Conquer:

You shouldn't have dependency over the iteration process to the same dataframe and different slice. Meaning if you want to read or write from other slice, it maybe difficult to do that.

=================== Divide and Conquer Approach =================

Step 1: Splitting/Slicing

In this step, we are going to divide the iteration over the entire dataframe. Think that you are going to read a csv file into pandas df then iterate over it. In may case I have 5,000,000 records and I am going to split it into 100,000 records.

NOTE: I need to reiterate as other runtime analysis explained in the other solutions in this page, "number of records" has exponential proportion of "runtime" on search on the df. Based on the benchmark on my data here are the results:

Number of records | Iteration per second
========================================
100,000           | 500 it/s
500,000           | 200 it/s
1,000,000         | 50 it/s
5,000,000         | 20 it/s

Step 2: Merging

This is going to be an easy step, just merge all the written csv files into one dataframe and write it into a bigger csv file.

Here is the sample code:

# Step 1 (Splitting/Slicing)
import pandas as pd
df_all = pd.read_csv('C:/KtV.csv')
df_index = 100000
df_len = len(df)
for i in range(df_len // df_index + 1):
    lower_bound = i * df_index 
    higher_bound = min(lower_bound + df_index, df_len)
    # splitting/slicing df (make sure to copy() otherwise it will be a view
    df = df_all[lower_bound:higher_bound].copy()
    '''
    write your iteration over the sliced df here
    using iterrows() or intertuples() or ...
    '''
    # writing into csv files
    df.to_csv('C:/KtV_prep_'+str(i)+'.csv')



# Step 2 (Merging)
filename='C:/KtV_prep_'
df = (pd.read_csv(f) for f in [filename+str(i)+'.csv' for i in range(ktv_len // ktv_index + 1)])
df_prep_all = pd.concat(df)
df_prep_all.to_csv('C:/KtV_prep_all.csv')

Reference:

Efficient way of iteration over datafreame

Concatenate csv files into one Pandas Dataframe

score 1 · Answer 29 · edited Jan 05 '22 at 20:48

As the accepted answer states, the fastest way to apply a function over rows is to use a vectorized function, the so-called NumPy ufuncs (universal functions).

But what should you do when the function you want to apply isn't already implemented in NumPy?

Well, using the vectorize decorator from numba, you can easily create ufuncs directly in Python like this:

from numba import vectorize, float64

@vectorize([float64(float64)])
def f(x):
    #x is your line, do something with it, and return a float

The documentation for this function is here: Creating NumPy universal functions

Ernesto Elsäßer · Answer 30 · 2021-08-02T10:04:04.483

Probably the most elegant solution (but certainly not the most efficient):

for row in df.values:
    c2 = row[1]
    print(row)
    # ...

for c1, c2 in df.values:
    # ...

Note that:

the documentation explicitly recommends to use .to_numpy() instead
the produced NumPy array will have a dtype that fits all columns, in the worst case object
there are good reasons not to use a loop in the first place

Still, I think this option should be included here, as a straight-forward solution to a (one should think) trivial problem.

score -2 · Answer 31 · edited Apr 21 '21 at 16:37

-2

Use df.iloc[]. For example, using dataframe 'rows_df':

Or

To get values from a specific row, you can convert the dataframe into ndarray.

Then select the row and column values like this:

edited Apr 21 '21 at 16:37

Peter Mortensen

30,030
21
100
124

answered Mar 04 '21 at 18:50

dna-data

33
4

6

Consider not posting code in images, but as text in a code block. – Scratte Mar 07 '21 at 09:31

score -2 · Answer 32 · edited Jan 05 '22 at 20:45

A better way is to convert the dataframe into a dictionary by using zip, creating a key value pairing, and then access the row values by key.

My answer shows how to use a dictionary as an alternative to Pandas. Some people think dictionaries and tuples are more efficient. You can easily replace the dictionary with a namedtuple list.

 inp = [{'c1':10, 'c2':100}, {'c1':11, 'c2':110}, {'c1':12, 'c2':120}]
 df = pd.DataFrame(inp)
 print(df)

 for row in inp:
     for (k, v) in zip(row.keys(), row.values()):
         print(k, v)

Output:

How to iterate over rows in a DataFrame in Pandas

32 Answers32