10

I am new to Python (no computer science background) for data science. I keep hearing that Python is easy, but I am making incremental progress. As an example, I understand:

len(titles[(titles.year >= 1950) & (titles.year <=1959)])

"In the titles dataframe, create a series and take from the year column of the titles dataframe anything greater than or equal to 1950 AND anything less than or equal to 1959. The take the length of it."

But when I encounter the following, I don't understand the logic of:

t = titles
(t.year // 10 * 10).value_counts().sort_index().plot(kind='bar')

or

titles.title.value_counts().head(10)

In both these cases, I can piece it together obviously. But it is not clear. In the second, why does Python not allow me to use square brackets and regular brackets like in the first example?

kindall
  • 168,929
  • 32
  • 262
  • 294
DataNoob7
  • 171
  • 7
  • 2
    Off hand, this doesn't appear to be just "vanilla" python. Do you have any libraries you are using? (numpy, scipy, anaconda, etc.) If you had to run a "pip" command, that installs libraries. It would be helpful to note / tag what libraries you are using. – Mark Ribau May 15 '19 at 00:12
  • 3
    @MarkRibau Looks like `pandas`. – gmds May 15 '19 at 00:12
  • 1
    Judging from the word "dataframes," pandas is right. – kindall May 15 '19 at 00:12
  • 1
    Where would you expect to use square brackets in your other examples? – Code-Apprentice May 15 '19 at 00:13
  • You could use the brackets on `t.year` as well, you just dont. I'm not sure I understand your confusion, exactly, can you elaborate? – juanpa.arrivillaga May 15 '19 at 00:14
  • Shoot. One I'm not particularly familiar with. Offhand, it's the way pandas exposes data... they /could/ implement the ability to use brackets instead of a function call on a returned object, but that may complicate other things. – Mark Ribau May 15 '19 at 00:15
  • If I were you , I will pass the bins in value_counts rather than do // – BENY May 15 '19 at 00:50

3 Answers3

15

This is not about lists vs pd.Series, but rather about the function of parentheses (()) vs brackets ([]) in Python.

Parentheses are used in two main cases: to modify the order of precedence of operations, and to delimit arguments when calling functions.

The difference between 1 + 2 * 3 and (1 + 2) * 3 is obvious, and if you want to pass a and b to a function f, f a b will not work, unlike in, say, Haskell.

We are concerned mostly with the first use here; for example, in this line:

(t.year // 10 * 10).value_counts().sort_index().plot(kind='bar')

Without the parentheses, you would be calling that chain of methods on 10, which wouldn't make sense. Clearly, you want to call them on the result of the parenthesised expression.

Now, in mathematics, brackets can also be used to denote precedence, in conjunction with parentheses, in a case where multiple nested parentheses would be confusing. For example, the two may be equivalent in mathematics:

[(1 + 2) * 3] ** 4
((1 + 2) * 3) ** 4

However, that is not the case in Python: ((1 + 2) * 3) ** 4 can be evaluated, whereas [(1 + 2) * 3] ** 4 is a TypeError, since the part within brackets resolves to a list, and you can't perform exponentiation on lists.

Rather, what happens in something like titles[titles.year >= 1950] is not directly relevant to precedence (though of course anything outside the brackets will not be part of the inner expression).

Instead, the brackets represent indexing; in some way, the value of titles.year >= 1950 is used to get elements from titles (this is done using overloading of the __getitem__ dunder method).

The exact nature of this indexing may differ; lists take integers, dicts take any hashable object and pd.Series take, among other things, boolean pd.Series (that is what is happening here), but they ultimately represent some way to subset the indexed object.

Semantically, therefore, we can see that brackets mean something different from parentheses, and are not interchangeable.

For completeness, using brackets as opposed to parentheses has one tangible benefit: it permits reassignment, because it automatically delegates to either __setitem__ or __getitem__, depending on whether assignment is being performed.

Therefore, you could do something like titles[titles.year >= 1950] = 'Nothing' if you wanted. However, in all cases, titles(titles.year >= 1950) = 'Nothing' delegates to __call__, and therefore will fail in the following way:

SyntaxError: can't assign to function call
gmds
  • 17,927
  • 4
  • 26
  • 51
  • 1
    "Now, in mathematics, brackets can also be used to denote precedence, in conjunction with parentheses, in a case where multiple nested parentheses would be confusing." This might hit on the main confusion if the OP is familiar with this usage in algebra. – Code-Apprentice May 15 '19 at 00:24
  • Wow, thank you for the fulsome response. Yes, while my background in algebra is small, that has definitely caused me some confusion learning Python. This is the point - [] vs (). In particular, in Pandas. I am trying to be able to understand the logic, (hence my internal mentalese). So in Pandas, are [] for indexing and not making a series? – DataNoob7 May 15 '19 at 01:25
  • @DataNoob7 Remember that indexing creates some form of subset! So, if you have a `Series`, you can index it to get *another* `Series`. If you mean from raw data, then that's a function call - something like `pd.Series(data)`. – gmds May 15 '19 at 01:27
  • @gmds Stupid question - How is indexing defined in computer science. To me, the word "indexing" is just creating a "table of contents" to identify different parts of something. But you haven't actually extracted anything yet. So your phrase "indexing creates some form of subset" throws me off. To me you are just assigning a number, a letter, etc to your dataframe so you have not taken a subset of anything yet? – DataNoob7 May 15 '19 at 01:40
  • 1
    @DataNoob7 In this context, "indexing" is something you do to a *collection*, which is an object that contains a number (including one or zero) objects. Examples of collections are `lists`, `tuples`, `dicts` and `pd.Series`. Indexing basically tells the collection to return some subset of the objects it contains, based on the arguments passed to the indexing function. For instance, for `lists`, you pass integer indices and get the elements at those positions. – gmds May 15 '19 at 03:47
  • @gmds I'm unsure where you've got the idea that _collection_ defines `__getitem__`. [Collection isn't a term in the glossery](https://docs.python.org/3/glossary.html), and `collections.abc.Collection` doesn't define `__getitem__` and so you can have a collection without the functionality you say it has. – Peilonrayz May 15 '19 at 11:01
  • @Peilonrayz I don't mean a Python term or class; it's a more general computer science concept. See [Wikipedia](https://en.wikipedia.org/wiki/Collection_(abstract_data_type)). – gmds May 15 '19 at 11:06
  • @gmds I don't see how you 'index a set'. – Peilonrayz May 15 '19 at 11:29
  • @Peilonrayz And I don't see where I said you could index a set, or that collections are defined by indexability. I said that indexing is something you do to collections i.e. it may be defined by a collection, or it may not. – gmds May 15 '19 at 11:35
5

Square brackets are used for indexes on lists and dictionaries (and things that act like these). On the other hand, parentheses are used for a variety of reasons. In this case, they are used for grouping in (t.year // 10 * 10) or as a function call in value_counts() and other places.

In the case of a library like pandas, whether you use indexing notation with [] or a function call is entirely determined by the implementation of the library. You can learn these details through tutorials and the library's documentation.

Before digging deeper into the pandas library, I suggest that you study the basics of Python syntax. The official tutorial is a good place to start.

On a side note, when you write code, do not make each line as complex as what you see in these examples. You should instead break things into smaller pieces and assign intermediate parts to variables. For example, you can take

(t.year // 10 * 10).value_counts().sort_index().plot(kind='bar')

and turn it into

decade = (t.year // 10 * 10)
counts = decated.value_counts()
sorted = counts.sort_index()
sorted.plot(kind='bar')
Code-Apprentice
  • 76,639
  • 19
  • 130
  • 241
  • 1
    I have to agree. If just starting out (especially if you're new to programming in general), start with python basics before jumping into the data science part w/ pandas, numpy, etc. – Mark Ribau May 15 '19 at 00:16
  • @anyone I know `sorted` is a bad name here because of the builtin function. If anyone has a better suggestion, feel free to edit. – Code-Apprentice May 15 '19 at 00:22
  • What about `sorted_counts`, or, to be more specific (and, unfortunately, verbose), `index_sorted_counts`? – gmds May 15 '19 at 00:29
  • I have started "Python basics" through other means, but I will check out the official tutorial. I have a lot of data that I need to analyze for work(many datasets ranging from 700k to 40 million), so I need to accelerate this though. The frustrating part is that I know what I have to, but it is translating it into Python code that is very difficult. I understood [] to denote a series in pandas, but it is also indexing? Indexing outside of pandas? To the point of code length - maybe that is the issue. – DataNoob7 May 15 '19 at 01:20
  • @DataNoob7 "I understood [] to denote a series in pandas" The word "denote" is probably very misleading here and isn't the the correct way to think about it. `[]` **only** indexes an object. When you index a list of integers, the result is an integer. On the other hand when you index a dataframe the result is a series. This is not the same as "denote". More importantly, if you think of `[]` as always indexing, it applies to all uses rather than making special cases for each usage. – Code-Apprentice May 15 '19 at 16:02
  • @DataNoob7 I suggest that you learn how `[]` work with lists and dictionaries. This will help clear up the confusion. – Code-Apprentice May 15 '19 at 16:06
  • @Code-Apprentice so indexing refers to "search and collect 'x'". Both a process and an action? I will look that difference up. Seems like there is a hierarchy - lists, then dictionaries, then series, then dataframes, etc, indexing returns the level below you are indexing. (Probably still the wrong way to think about programming haha) – DataNoob7 May 15 '19 at 21:58
2
t = titles
(t.year // 10 * 10).value_counts().sort_index().plot(kind='bar')

titles is a data frame. year is a column in that frame. In order, the operations are

  • Divide the year by 10 (integer division) and multiply by 10. This truncates the last digit to 0, so that each year is the beginning of its decade. The result of this is another column, the same length as the original.
  • Count the values; this will produce a new table with an entry (year, frequency) for each decade-year.
  • Sort this table by the default index
  • Make a bar plot of the result.

Does that get you going?

Prune
  • 75,308
  • 14
  • 55
  • 76
  • Thanks - I can understand what the code means, it's understanding the logic of why the code is written as such (see my original post). – DataNoob7 May 15 '19 at 01:17