9

I want to use the iris dataset provided by scikit-learn for a paper. But I don't know what the standard for referencing datasets is. What citation should I use for this dataset in my paper? Should I reference scikit-learn? Ronald Fisher for having introduced the dataset? Edgar Anderson for having collected the data? All of the above?

Wolfgang
  • 16,997
  • 4
    If you're going to reference Fisher, you need to spell his name right. My suggestion: if Fisher gave a reference when he first used it, use that reference. If he didn't, reference him (I think most people do that). – Glen_b Oct 12 '14 at 22:15
  • 2
    Unless you are restricted in the number of references, there is no harm in citing both. I find I've never read the Anderson original, but I wouldn't assume he didn't analyse the data unless you have read it too. The Fisher reference is important; my wild guess is that the dataset would have faded into statistical obscurity without Fisher making it prominent. – Nick Cox Oct 13 '14 at 12:50
  • 5
    http://stats.stackexchange.com/questions/74776/what-aspects-of-the-iris-data-set-make-it-so-successful-as-an-example-teaching/74901 doesn't answer your question, but it comments on some common minor errors in working with this dataset. – Nick Cox Oct 13 '14 at 12:51
  • 2
    I would cite both papers (Anderson, 1936; Fisher, 1936), but not scikit-learn, as the dataset is simply bundled with the library, but is not unique to it (for example, the same iris dataset is bundled with R environment, as well). – Aleksandr Blekh Oct 13 '14 at 13:58
  • @aleksandr Blekh - The OP is using a dataset provided by scikit-learn. The page on which the dataset appears mentions "If you use the software, please consider citing scikit-learn". Why would you not cite scikit-learn then? – martino Oct 13 '14 at 14:20
  • 5
    @martino: The scikit-learn certainly has to be cited, if used. However, the OP's question was in regard to citing the iris dataset, which calls for an independent citation. This is because the dataset is an independent entity, which is included in many software packages and is not unique to scikit-learn. (By the way, it wasn't me, who downvoted your answer, in case you are curious.) – Aleksandr Blekh Oct 14 '14 at 04:44
  • 1
    @AleksandrBlekh I think your comment is the answer to my question. – usernumber Nov 27 '14 at 00:25
  • All right. Then I will submit my comment as the answer, so that you could upvote and accept it, if you wish. Always glad to help. – Aleksandr Blekh Nov 27 '14 at 04:13

2 Answers2

6

I would cite both papers (Anderson, 1936; Fisher, 1936), but not scikit-learn, as the dataset is simply bundled with the library, but is not unique to it (for example, the same iris dataset is bundled with R environment, as well). Having said that, scikit-learn certainly has to be cited as well, if used, but not due to use of the dataset.

  • It is worth mentioning that Fisher's paper titled "The use of multiple measurements in taxonomic problems" published in 1936 is part of the annals of eugenics. Although this paper is commonly cited for the methods employed in creating the Iris dataset, it is important to be aware of potential sensitivities that may arise, especially in academic settings where politically frustrated lecturers might be present. – Florian Fasmeyer Jun 23 '23 at 14:11
  • More context: When discussing Iris, we talk about flowers, not the human eye. – Florian Fasmeyer Jun 23 '23 at 14:13
-3

I think that citing Scikit-learn is sufficient. According to Scikit-Learn documentation you should cite their paper. You can always add a reference the Iris datset in Scikit-Learn by providing a link to the page.

EDIT - I stand corrected. The accepted answer is spot on

martino
  • 1,690
  • 4
    Part of the purpose of a citation is to allow others to find the same data source and a link would serve this purpose very well. But another purpose of citation is to provide academic credit - which in this case belongs less to the developers, but more to the researchers who collated (and perhaps popularised) the data set. It's in this second aspect I find this answer less convincing. – Silverfish Nov 27 '14 at 07:28