Questions tagged [data-preprocessing]

A step of cleaning data in data mining for analysis purposes

Data preprocessing is a data mining technique that involves transforming raw data into format which is handy for further analysis. Some issues which often arise are inconsistencies and missing values.


Data preprocessing is used database-driven applications such as customer relationship management and rule-based applications. Data preprocessing is particularly important when you implement an Artificial Neural Network.

519 questions
10
votes
3 answers

Automatic data cleansing

A common problem is ML is poor quality of the data: errors in feature values, misclassified instances, etc etc. One way of addressing this problem is to manually go through the data and check, but are there other techniques? (I bet there are!)…
andreister
  • 3,357
9
votes
2 answers

Creating "demo" data from real data: disguising without disfiguring

(I have no real idea what to tag this with because I'm no statistician and I don't know what field this falls into. Feel free to add more suitable tags.) I work for a company that produces data analysis software, and we need a decent set of data to…
7
votes
0 answers

Does the definition of what considered "tidy data" differ by application?

After reading a recent paper by Hadley (link), I got to thinking about whether what we'd refer to as tidy data changes by application. For example, consider a sample dataset: Food item | Carbohydrates | Fat F1 | 10 | 12 F2 |…
Naumz
  • 171
5
votes
2 answers

Dropping data from people who have "perfect" scores

OK, so I have data from a class that had a preparatory self-test to see how prepared they were for the class, and the final results for the class. The preparatory self-test had a range from 0..13 and the final had a range from 0..100. (don't mind…
bnsh
  • 163
4
votes
2 answers

what should be done first, handling missing data or dealing with data types?

In data science, Which process should come first, handling missing data or handling data types. I am asking this question because I have problem in following cases: 1) Handling Missing data first, then handling data types - It would be difficult to…
Kiran
  • 191
2
votes
2 answers

Is it 40% or 0.4%?

A variable, which should contain percents, also contains some "ratio" values, for example: 0.61 41 54 .4 .39 20 52 0.7 12 70 82 The real distribution parameters are unknown but I guess it is unimodal with most (say over 70% of) values occurring…
Orion
  • 214
2
votes
0 answers

Strategies for dealing with near zero variance

I am trying to create a predictive model for future stock returns. At a high level, I'd like to explore the idea that the stock market is dynamic, that a predictive model should shift/evolve through time. What I've been considering is creating a…
rmacey
  • 313
2
votes
0 answers

Is there a way to express how "dirty" a data set is?

I would like to know some general parameters that can be used to describe how "dirty" the data is. Issues I am having are the following: Lots of missing values; The values are some predictors are filled in but often completely wrong; I can try to…
Kasper
  • 3,399
1
vote
0 answers

Questions about pre-processing/transformation of data

For an assignment for a ML Online course I have to find the best classifier for a given data set using 4 different methods: Logistic regression, Decision tree, Support Vectors and K Nearest Neighbours. The data was already pre-processed (for some…
ISquared
  • 129
1
vote
1 answer

How long does it take to clean data?

I am trying to plan out how long it will take me to clean my survey data. I have about 200 responses. The survey takes about 15 minutes, about 40-60 questions (depending on the logic). I have very few open-ended questions (maybe three total).…
1
vote
1 answer

Data cleaning: Derived variables

I am cleaning data that I will use with machine learning prediction algorithms. Several of my variables in my data set are sums of other variables. eg) given variables x1, x2, x3, x3=x2+x1 or even x4= x5+x6+...x10. I feel like I should remove these…
sma
  • 233
1
vote
1 answer

Pre-Processing - Applied on all three (training/validation/test) sets?

From what I understand from previously answered questions, you're meant to do your pre-processing on each set after splitting your data into training and test sets. But I'm not sure where the validation set comes into this. Do I also pre-process it…
1
vote
1 answer

For data reduction, what is this technique called?

I have observations X along with their labels Y. I then create a histogram of Y. I then remove observations such that the histogram still retains the same distinct shape. Does anyone know what data reduction technique this is called?
user46925
0
votes
0 answers

organizing analytical file for persons with multiple records

My stakeholders have manually recorded data for patients enrolled in an intervention, which is causing data issues that I need to resolve in order to move to determine what is the appropriate statistical approach. Each person has a unique medical ID…
0
votes
0 answers

How to process multidimensional feature in python?

Hi there so my dataset looks as follow: Patient ID Medicine Death 1 A,B,C,D,E 1 2 B,D 0 3 A,D,E 1 So my dependent feature is death and my independent feature is medicine. I am trying to predict death based on the medication received…
1
2