what to do with missing data in r

[This commodity was first published on Methods – finnstats, and kindly contributed to R-bloggers]. (You tin report effect near the content on this page here)


Want to share your content on R-bloggers? click hither if you have a blog, or hither if you don't.

Handling missing values in R, one of the mutual tasks in data analysis is handling missing values.

In R, missing values are often represented by the symbol NA (not available) or some other value that represents missing values (i.e. 99).

Incommunicable values (e.one thousand., dividing past zero) are represented by the symbol NaN (not a number)

Handling missing values in R

Yous can test the missing values based on the below control in R

y <- c(1,2,three,NA) is.na(y) # returns a vector (F F F T)

This function you lot can use for vector as well as information frame also.

To place the location of NAs in a vector, you tin can employ which command.

Run R codes in PyCharm

which(is.na(y))

In the instance of information frame, sum function will be handy

sum(is.na(df))

df indicates the information frame

the function volition return a total number of NA values.

In the case of data frames with multiple columns, a convenient shortcut method is colSum.

colSums(is.na(df))

the summary role also tin can be used for finding missing values in information frames.

Suppose if you are calculating the average value and so na.rm volition exist very useful.

x[is.na(ten)] <- mean(x, na.rm = TRUE)

Sometimes values are stored equally 99 that yous can catechumen into NA using the following command.

df$v1[df$v1==99] <- NA

one of the common methods is na.omit.

The function na.omit() returns the object with listwise deletion of missing values

This function create new dataset without missing information

How to clean information sets?

newdata <- na.omit(df)

Another traditional style of handling missing value is based on complele.cases.

The function consummate.cases() returns a logical vector indicating which cases are complete.

This will listing rows of data that have missing values

df[!complete.cases(df),]

subset with consummate.cases to become complete cases

df[complete.cases(df), ]

Deleting rows containing missing values, lead to a reduction in sample size and avoid some good representative values also.

Delete Missing Data leads to the following major issues

  • Data loss
  • Bias information

Above mentioned methods are non handy in some cases, one of the commonly used method is to replace NA values based on average. In some cases this will be erroneous.

Imagine supposing if we have vehicle failure data with the following details

Vehicle Calendar month Distance Travelled
1 35 20000
2 24 18000
3 1 NA
four 12 6000
Handling missing values in R

In the higher up data if we are replacing NA values with 14666 (ie. average value) will exist erroneous because just a i-month-old vehicle traveled this much distance is unlikely.

pdftools and pdftk in R

Now here nosotros are going to discuss, how to deal with this kind of issue smartly!

Load Libraries

library(mice) library(VIM)

Getting Information

data <- read.csv("D:/RStudio/MissingDataAnalysis/VehicleData.csv", header = T) str(data) 'data.frame': 1624 obs. of  7 variables:  $ vehicle: int  i 2 3 4 5 half dozen 7 8 9 x ...  $ fm     : int  0 10 15 0 13 21 11 5 8 1 ...  $ Mileage: int  863 4644 16330 13 22537 40931 34762 11051 7003 eleven ...  $ lh     : num  1.1 ii.iv 4.2 i 4.five 3.i 0.7 2.9 3.4 0.vii ...  $ lc     : num  66.3 233 325.i 66.six 328.7 ...  $ mc     : num  697 120 175 0 175 ...  $ State  : chr  "MS" "CA" "WI" "OR" ...

In this dataset contains 1624 observations and seven variables. You can admission data set from here.

summary(information) vehicle             fm            Mileage            lh               lc               mc            Country Min.   :   1.0   Min.   :-1.000   Min.   :    1   Min.   : 0.000   Min.   :   0.0   Min.   :   0.0   Length:1624    1st Qu.: 406.8   1st Qu.: 4.000   1st Qu.: 5778   1st Qu.: 1.500   1st Qu.: 106.5   1st Qu.: 119.7   Class :graphic symbol  Median : 812.5   Median :10.000   Median :17000   Median : ii.600   Median : 195.4   Median : 119.7   Fashion  :character  Hateful   : 812.five   Hateful   : 9.414   Hateful   :20559   Mean   : 3.294   Mean   : 242.viii   Mean   : 179.four                     3rd Qu.:1218.2   tertiary Qu.:14.000   3rd Qu.:30061   third Qu.: four.300   tertiary Qu.: 317.viii   3rd Qu.: 175.5                     Max.   :1624.0   Max.   :23.000   Max.   :99983   Max.   :35.200   Max.   :3234.4   Max.   :3891.one                     NA'southward   :thirteen      NA's   :six        NA's   :8

Based on summary part as mentioned earlier we can find out the details of cavalcade contains missing values.

Missing data

Now we need to calculate what pct of information is missing from each variable.

SharePoint and R integration

p <- function(x) {sum(is.na(x))/length(ten)*100} apply(data, two, p) vehicle        fm   Mileage        lh        lc        mc     State 0.0     0.0000000 0.8004926 0.3694581 0.4926108 0.0000000 0.9236453

Vehicle, fm and mc contains no missing values, lh contains 0.36%, lc contains 0.49%, Mileage contains 0.80% and maximum missing in state cavalcade with 0.92%

md.pattern(information)
              vehicle fm mc lh lc   Mileage State   1586       1  1  i  1  ane       ane     1  0 11         1  i  1  1  i       1     0  ane xiii         1  1  ane  1  1       0     1  ane 6          1  1  ane  1  0       1     one  1 2          1  1  1  1  0       1     0  ii 4          one  1  1  0  1       1     i  i 2          i  1  ane  0  1       1     0  ii            0  0  0  6  8      13    fifteen 42

aught indicates missing values, for example column 'state' contains xi rows with 1 missing values.

Similarly, xiii rows mileage values are missing.

Marketplace Basket Assay in R

physician.pairs(information) $rr         vehicle   fm Mileage   lh   lc   mc Land vehicle    1624 1624    1611 1618 1616 1624  1609 fm         1624 1624    1611 1618 1616 1624  1609 Mileage    1611 1611    1611 1605 1603 1611  1596 lh         1618 1618    1605 1618 1610 1618  1605 lc         1616 1616    1603 1610 1616 1616  1603 mc         1624 1624    1611 1618 1616 1624  1609 Land      1609 1609    1596 1605 1603 1609  1609 $rm         vehicle fm Mileage lh lc mc State vehicle       0  0      13  6  8  0    15  fm            0  0      13  6  viii  0    fifteen Mileage       0  0       0  6  8  0    15 lh            0  0      xiii  0  8  0    thirteen lc            0  0      13  six  0  0    13 mc            0  0      xiii  vi  viii  0    15 Land         0  0      13  4  6  0     0 $mr         vehicle fm Mileage lh lc mc Land vehicle       0  0       0  0  0  0     0 fm            0  0       0  0  0  0     0 Mileage      13 13       0 13 13 13    thirteen lh            vi  6       6  0  6  6     4 c            eight  8       8  8  0  viii     6 mc            0  0       0  0  0  0     0 State        fifteen 15      15 13 13 fifteen     0 $mm         vehicle fm Mileage lh lc mc Country vehicle       0  0       0  0  0  0     0 fm            0  0       0  0  0  0     0 Mileage       0  0      13  0  0  0     0 lh            0  0       0  6  0  0     two lc            0  0       0  0  8  0     two mc            0  0       0  0  0  0     0 State         0  0       0  2  2  0    fifteen

rr indicates how many information points are observed

rm indicates Observed and Missing

mr indicates Mossing versus observed

mm indicates Missing versus Missing

marginplot(information[,c('Mileage', 'lc')])

Bluish values are observed values and carmine ones are missing values.

Impute

impute <- mice(data[,ii:7], m=iii, seed = 123) print(impute) Class: mids Number of multiple imputations:  iii Imputation methods:      fm Mileage      lh      lc      mc   Country      ""   "pmm"   "pmm"   "pmm"      ""      "" PredictorMatrix:         fm Mileage lh lc mc State fm       0       1  ane  1  1     0 Mileage  1       0  one  1  i     0 lh       1       ane  0  1  1     0 lc       1       1  1  0  1     0 mc       1       1  1  one  0     0 State    0       0  0  0  0     0 Number of logged events:  1   it im dep     meth   out 1  0  0     constant State

Here you tin can come across different methods for imputation.

Decision Trees in R

For example, variable fm contains no missing values and hence no method applied.

For the variable Mileage, lh and lc "pmm" method used.

pmm stands for predictive Hateful Matching.

polyreg used for factor variables, polyreg stands for multinomial logistic regression.

impute$imp$Mileage    i     2     3 19   40558 25481 21179 20   25478 13785 16319 253    138  1945   251 254  13089 16963 31078 255  26785 47232 22101 256  34822 33543 19486 861   6136 17212 75106 862   2844   243  1591 863  16319 19539 17610 1568 29262 17948 27095 1569 15299 11912  3253 1570    94   277    31 1571  3296 11000 17217

Full 3 imputations calculated here for understanding purpose, the best imputation value suited to your dataset can used for further analysis. Simply look at the 20th row

data[20,] vehicle fm Mileage  lh    lc   mc State 20      20  8      NA 1.iv 87.42 1.85    NH

20th-row Mileage is missing the first imputation is correspond is 25478, the 2nd one is 13785, and the tertiary one is 16319. So ideally the all values are always better than average.

Social Network Assay in R

Complete data

newDATA <- complete(impute, one)

Distribution of oberserved/imputed values

xyplot(impute, lc ~ lh | .imp, pch = 20, cex=one.iv)

First, one is original observations and followed by impute1, 2, and 3. You tin can see there are no changes after imputing the observations.

Conclusion

Based on the mice packet missing values tin handle smartly, sympathise your data sets, and apply correct algorithms.

If yous are using any other methods or functions please mention in comments section.

Gradient Boosting in R

The postal service Handling missing values in R appeared first on finnstats.

ingrahamzied1975.blogspot.com

Source: https://www.r-bloggers.com/2021/04/handling-missing-values-in-r/

0 Response to "what to do with missing data in r"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel