1

I would like to figure out why some devices of my company fail. Therefore, I'm able to use a list in which around 300 devices are listed together with about 70 parameters while only half of it is numerical, the others are mostly ok/not ok or quite a bunch of comments. The latter is hard to use for an analysis, I guess? However, the list contains mainly devices which failed within their first year (which means our warranty kicked in ^^).

I'm thinking about how to tackle this task in means of methods. At the moment I'm eyeballing on the data via scatter plots / correlations. I'm aware of various methods but I wonder which makes really sense instead of applying methods I can hardly draw insights from. My next move will be a PCA and/or a EFA, also already wonder, which makes more sense? Does a PCA really reveal information about the most influence (here)? Or instead a SEM? I've never worked with the latter one but it seems like what I'm looking for? However, it seems I have to know the so-called "latent" variable upfront resp. what it could be but this is what I am looking for? Could you evaluate those methods and/or provide some other useful approaches?

Ben
  • 3,443
  • 2
    Looking at a biplot of your data where the points are coloured by whether they pass or failed is not guaranteed to be insightful, but it is a good start in an exploratory data analysis. – Galen Mar 15 '22 at 20:50
  • 1
  • I hope you have data for devices that fail and devices that don't, otherwise it will be easy to be misled (noticing "this happens a lot with failing devices" is no use if it happens even more with non-failing ones). 2. If you do have both, ultimately you may be looking to say logistic regression, but you may want to split off some data to select features or derive new ones. 3. You may be able to identify particular keywords and combinations of keywords in the text field that are likely to be indicative; some things are very likely to suggest red flags to a process engineer, I expect
  • – Glen_b Mar 16 '22 at 00:44
  • Thank you both, this is helping! So far, I lack information about devices without failures.. have to check whether I can get some. – Ben Mar 16 '22 at 06:29
  • btw, in case such data are not available, do you think I can create artificial devices by creating random parameters from intervals as they should be? – Ben Mar 16 '22 at 07:56
  • 1
    @Ben, last comment: I don't think this is a good idea, as there is no guarantee whatsoever that this creates "realistic" data. You'd need to have a fairly reliable model for how parameters for non-failing devices should look like, not just intervals. – Christian Hennig Mar 22 '22 at 08:46
  • 1
    If you don't have data about devices that did not fail to make a comparison between devices that failed and did not fail (yet), then you might still learn something if you have data about the type of failure. For instance you mentioned that you know the time untill failure. Patterns that you find there might help you get some clues about the causes of failure. – Sextus Empiricus Mar 22 '22 at 09:03