0

I have been reviewing the paper PNrule: A new Framework for Learning Classifier Models in Data by Agarwal & Joshi (2000) and the associated technical report. The paper outlines an approach to managing datasets with a severely imbalanced binary outcome which takes an iterative approach to maximise both overall accuracy and precision.

The approach seems promising, and stands apart from the fairly homogenous body of approaches that incorporates cost management and various nearest-neighbour-based methods to delete points or synthesise new ones. The authors also report excellent performance in a Kaggle competition, though other real-world evidence of performance is lacking.

I have been unable to find the algorithm encoded in a package for either Python or R, whether under the name of one of the original authors or otherwise. I was wondering if anyone else was aware of a relevant package. It would be good to know before starting on any coding from scratch.
Similarly, I was wondering whether anyone knew of a reason why this isn't more popular twenty years on. Perhaps the clue is in it having more of a feel of a framework rather than something you can just do as a preliminary step in an existing data science pipeline.

Many thanks.

PS I can see there is a stream of thought that people experiencing problems with sensitivity scores due to data imbalances are suffering from an overactive imagination. I would like to be clear that the problems I am trying to solve are real.

demim00nde
  • 349
  • 1
  • 9
  • 1
    "Unbalanced" datasets are per se not problematic. The "problems" arise when we use KPIs like accuracy, precision or recall. What looks like "intuitive understandability" masks behavior that does not conform to this intuition (all these KPIs have the same issues). The solution is usually not to perform wilder and wilder gyrations to game accuracy etc., but to switch to probabilistic predictions and evaluate these using proper [tag:scoring-rules]. – Stephan Kolassa Jan 06 '23 at 16:30
  • @StephanKolassa I'm sure your remarks are well-addressed to all budding data scientists. I can't say I'm sure I see a connection with my question, but that's fine. – demim00nde Jan 06 '23 at 16:43
  • 2
    The connection to your question is that your question seeks clarification on a technique that aims to remedy something that turns out not to be a problem. // A meta post on class imbalance has additional links and discussion. – Dave Jan 06 '23 at 17:10
  • Thanks both. I will have a look at your links and potentially reconsider my question. – demim00nde Jan 06 '23 at 18:16
  • I would refer you to Diklan Marsupial's answer in the thread you both point to. I don't want to characterise your discussion there simplistically in a comment, but will say that my organisation has few outcomes in its minority class, lacks the ability to establish enormous datasets that would achieve low asymptotic variance where helpful, and is interested in the prediction problem. In addition, proper scoring rules and probability modelling are inappropriate even in penalised regression, and more intrinsically in SVMs and decision trees. I think you're in a bit of a bubble. – demim00nde Jan 07 '23 at 10:19
  • I have read, upvoted and awarded a bounty to Dikran's answer, because it does explain one issue with "unbalanced" data, namely that the precision of parameter estimates degrades if data are unbalanced. However, this is orthogonal to the general understanding that "class imbalance must be addressed *in the context of KPIs like accuracy, precision, recall etc", a statement which Dikran does not defend. ... – Stephan Kolassa Jan 08 '23 at 15:20
  • 1
    ... I am a bit surprised at your statement that "proper scoring rules and probability modelling are inappropriate even in penalised regression". Would you care to explain? I can also post a formal question so you can answer it, if your answer is enlightening I promise to upvote, accept and/or bounty it. If you believe we are in a bit of a bubble, I invite you to take a pointy pin to it and burst it. I have seen a lot of pushback on my position here, but nothing I have found convincing so far. I would love to learn where I am wrong. – Stephan Kolassa Jan 08 '23 at 15:22

0 Answers0