what to do with missing data in r
[This commodity was first published on Methods – finnstats, and kindly contributed to R-bloggers]. (You tin report effect near the content on this page here)
Want to share your content on R-bloggers? click hither if you have a blog, or hither if you don't.
Handling missing values in R, one of the mutual tasks in data analysis is handling missing values.
In R, missing values are often represented by the symbol NA (not available) or some other value that represents missing values (i.e. 99).
Incommunicable values (e.one thousand., dividing past zero) are represented by the symbol NaN (not a number)
Handling missing values in R
Yous can test the missing values based on the below control in R
y <- c(1,2,three,NA) is.na(y) # returns a vector (F F F T)
This function you lot can use for vector as well as information frame also.
To place the location of NAs in a vector, you tin can employ which command.
Run R codes in PyCharm
which(is.na(y))
In the instance of information frame, sum function will be handy
sum(is.na(df))
df indicates the information frame
the function volition return a total number of NA values.
In the case of data frames with multiple columns, a convenient shortcut method is colSum.
colSums(is.na(df))
the summary role also tin can be used for finding missing values in information frames.
Suppose if you are calculating the average value and so na.rm volition exist very useful.
x[is.na(ten)] <- mean(x, na.rm = TRUE)
Sometimes values are stored equally 99 that yous can catechumen into NA using the following command.
df$v1[df$v1==99] <- NA
one of the common methods is na.omit.
The function na.omit() returns the object with listwise deletion of missing values
This function create new dataset without missing information
How to clean information sets?
newdata <- na.omit(df)
Another traditional style of handling missing value is based on complele.cases.
The function consummate.cases() returns a logical vector indicating which cases are complete.
This will listing rows of data that have missing values
df[!complete.cases(df),]
subset with consummate.cases to become complete cases
df[complete.cases(df), ]
Deleting rows containing missing values, lead to a reduction in sample size and avoid some good representative values also.
Delete Missing Data leads to the following major issues
- Data loss
- Bias information
Above mentioned methods are non handy in some cases, one of the commonly used method is to replace NA values based on average. In some cases this will be erroneous.
Imagine supposing if we have vehicle failure data with the following details
Vehicle | Calendar month | Distance Travelled |
1 | 35 | 20000 |
2 | 24 | 18000 |
3 | 1 | NA |
four | 12 | 6000 |
In the higher up data if we are replacing NA values with 14666 (ie. average value) will exist erroneous because just a i-month-old vehicle traveled this much distance is unlikely.
pdftools and pdftk in R
Now here nosotros are going to discuss, how to deal with this kind of issue smartly!
Load Libraries
library(mice) library(VIM)
Getting Information
data <- read.csv("D:/RStudio/MissingDataAnalysis/VehicleData.csv", header = T) str(data) 'data.frame': 1624 obs. of 7 variables: $ vehicle: int i 2 3 4 5 half dozen 7 8 9 x ... $ fm : int 0 10 15 0 13 21 11 5 8 1 ... $ Mileage: int 863 4644 16330 13 22537 40931 34762 11051 7003 eleven ... $ lh : num 1.1 ii.iv 4.2 i 4.five 3.i 0.7 2.9 3.4 0.vii ... $ lc : num 66.3 233 325.i 66.six 328.7 ... $ mc : num 697 120 175 0 175 ... $ State : chr "MS" "CA" "WI" "OR" ...
In this dataset contains 1624 observations and seven variables. You can admission data set from here.
summary(information) vehicle fm Mileage lh lc mc Country Min. : 1.0 Min. :-1.000 Min. : 1 Min. : 0.000 Min. : 0.0 Min. : 0.0 Length:1624 1st Qu.: 406.8 1st Qu.: 4.000 1st Qu.: 5778 1st Qu.: 1.500 1st Qu.: 106.5 1st Qu.: 119.7 Class :graphic symbol Median : 812.5 Median :10.000 Median :17000 Median : ii.600 Median : 195.4 Median : 119.7 Fashion :character Hateful : 812.five Hateful : 9.414 Hateful :20559 Mean : 3.294 Mean : 242.viii Mean : 179.four 3rd Qu.:1218.2 tertiary Qu.:14.000 3rd Qu.:30061 third Qu.: four.300 tertiary Qu.: 317.viii 3rd Qu.: 175.5 Max. :1624.0 Max. :23.000 Max. :99983 Max. :35.200 Max. :3234.4 Max. :3891.one NA'southward :thirteen NA's :six NA's :8
Based on summary part as mentioned earlier we can find out the details of cavalcade contains missing values.
Missing data
Now we need to calculate what pct of information is missing from each variable.
SharePoint and R integration
p <- function(x) {sum(is.na(x))/length(ten)*100} apply(data, two, p) vehicle fm Mileage lh lc mc State 0.0 0.0000000 0.8004926 0.3694581 0.4926108 0.0000000 0.9236453
Vehicle, fm and mc contains no missing values, lh contains 0.36%, lc contains 0.49%, Mileage contains 0.80% and maximum missing in state cavalcade with 0.92%
md.pattern(information)
vehicle fm mc lh lc Mileage State 1586 1 1 i 1 ane ane 1 0 11 1 i 1 1 i 1 0 ane xiii 1 1 ane 1 1 0 1 ane 6 1 1 ane 1 0 1 one 1 2 1 1 1 1 0 1 0 ii 4 one 1 1 0 1 1 i i 2 i 1 ane 0 1 1 0 ii 0 0 0 6 8 13 fifteen 42
aught indicates missing values, for example column 'state' contains xi rows with 1 missing values.
Similarly, xiii rows mileage values are missing.
Marketplace Basket Assay in R
physician.pairs(information) $rr vehicle fm Mileage lh lc mc Land vehicle 1624 1624 1611 1618 1616 1624 1609 fm 1624 1624 1611 1618 1616 1624 1609 Mileage 1611 1611 1611 1605 1603 1611 1596 lh 1618 1618 1605 1618 1610 1618 1605 lc 1616 1616 1603 1610 1616 1616 1603 mc 1624 1624 1611 1618 1616 1624 1609 Land 1609 1609 1596 1605 1603 1609 1609 $rm vehicle fm Mileage lh lc mc State vehicle 0 0 13 6 8 0 15 fm 0 0 13 6 viii 0 fifteen Mileage 0 0 0 6 8 0 15 lh 0 0 xiii 0 8 0 thirteen lc 0 0 13 six 0 0 13 mc 0 0 xiii vi viii 0 15 Land 0 0 13 4 6 0 0 $mr vehicle fm Mileage lh lc mc Land vehicle 0 0 0 0 0 0 0 fm 0 0 0 0 0 0 0 Mileage 13 13 0 13 13 13 thirteen lh vi 6 6 0 6 6 4 c eight 8 8 8 0 viii 6 mc 0 0 0 0 0 0 0 State fifteen 15 15 13 13 fifteen 0 $mm vehicle fm Mileage lh lc mc Country vehicle 0 0 0 0 0 0 0 fm 0 0 0 0 0 0 0 Mileage 0 0 13 0 0 0 0 lh 0 0 0 6 0 0 two lc 0 0 0 0 8 0 two mc 0 0 0 0 0 0 0 State 0 0 0 2 2 0 fifteen
rr indicates how many information points are observed
rm indicates Observed and Missing
mr indicates Mossing versus observed
mm indicates Missing versus Missing
marginplot(information[,c('Mileage', 'lc')])
Bluish values are observed values and carmine ones are missing values.
Impute
impute <- mice(data[,ii:7], m=iii, seed = 123) print(impute) Class: mids Number of multiple imputations: iii Imputation methods: fm Mileage lh lc mc Country "" "pmm" "pmm" "pmm" "" "" PredictorMatrix: fm Mileage lh lc mc State fm 0 1 ane 1 1 0 Mileage 1 0 one 1 i 0 lh 1 ane 0 1 1 0 lc 1 1 1 0 1 0 mc 1 1 1 one 0 0 State 0 0 0 0 0 0 Number of logged events: 1 it im dep meth out 1 0 0 constant State
Here you tin can come across different methods for imputation.
Decision Trees in R
For example, variable fm contains no missing values and hence no method applied.
For the variable Mileage, lh and lc "pmm" method used.
pmm stands for predictive Hateful Matching.
polyreg used for factor variables, polyreg stands for multinomial logistic regression.
impute$imp$Mileage i 2 3 19 40558 25481 21179 20 25478 13785 16319 253 138 1945 251 254 13089 16963 31078 255 26785 47232 22101 256 34822 33543 19486 861 6136 17212 75106 862 2844 243 1591 863 16319 19539 17610 1568 29262 17948 27095 1569 15299 11912 3253 1570 94 277 31 1571 3296 11000 17217
Full 3 imputations calculated here for understanding purpose, the best imputation value suited to your dataset can used for further analysis. Simply look at the 20th row
data[20,] vehicle fm Mileage lh lc mc State 20 20 8 NA 1.iv 87.42 1.85 NH
20th-row Mileage is missing the first imputation is correspond is 25478, the 2nd one is 13785, and the tertiary one is 16319. So ideally the all values are always better than average.
Social Network Assay in R
Complete data
newDATA <- complete(impute, one)
Distribution of oberserved/imputed values
xyplot(impute, lc ~ lh | .imp, pch = 20, cex=one.iv)
First, one is original observations and followed by impute1, 2, and 3. You tin can see there are no changes after imputing the observations.
Conclusion
Based on the mice packet missing values tin handle smartly, sympathise your data sets, and apply correct algorithms.
If yous are using any other methods or functions please mention in comments section.
Gradient Boosting in R
The postal service Handling missing values in R appeared first on finnstats.
Source: https://www.r-bloggers.com/2021/04/handling-missing-values-in-r/
0 Response to "what to do with missing data in r"
Post a Comment