final_project.Rmd.

  • HW: Final Project
  • Fall 2021, DSPA (HS650)
  • Name: Kevin Wu
  • SID: ####0012 (last 4 digits only)
  • UMich E-mail:
  • I certify that the following paper represents my own independent work and conforms with the guidelines of academic honesty described in the UMich student handbook.
  • Remember that students are allowed and encouraged to discuss, on a conceptual level, the problems with your class mates, however, this can not involve the exchange of actual code, printouts, solutions, e-mails or other explicit electronic or paper handouts.

1 Final Project Introduction

The Panel Study of Income Dynamics (PSID) began in 1968, studying over 18,000 individuals living in 5,000 families in the US, and is the longest running longitudinal household survey in the world (Panel Study of Income Dynamics, n.d.). Various data such as employment, income, health, marriage, childbearing, philanthropy, education, etc. have been collected. In 1997, the PSID launched the Child Development Supplement (CDS) I-III studying the children of these families. The Transition into Adulthood Supplement (TAS) began in 2005 which collected data from these very children transitioning into young adults. The following data is a custom dataset composed of the CDSII (children interviewed in 2001) and follows these children in the TAS 2011.

The following dataset I am analyzing is a filtered version of a custom dataset Dr. Ashley Palmer compiled from a study done by Dr. Corey Keyes in 2006 using the CDSII data.

In the dataset, they determined who was languishing, stable, and flourishing based upon answers to questions directly asking about their mental health. Because of its longitudinal nature, they also identified positive change states. I’ve decided the outcome variable of my study is where “change” = 2 (desirable move/state) vs 0 (steady but not ideal move/state) & 1 (undesirable move/state).

The questions I’m looking to answer are
* What are the best predictors? * Which model has the best performance?

These are the methods of approaches I have considered to analyze the data
1. Use feature selection models (Boruta and Stepwise Feature Selection) given the nature of my study (link).
2. Use clustering nearest neighbors (kNN, k-Means) using one-hot encoding because via clustering I can determine the trends within the clusters that may be predictive in nature (link).
3. Apriori techniques to understand the possible relationships between the features (link).
4. Decision Trees to determine the model that performs the best (link).

2 Dataset Details

Some features were removed from Dr. Palmer’s custom dataset because they were derived from the original CDSII and TAS2011 questions that are used to create the “change” features - the features that I am studying.

In the effort to factorialize “mhstatus” and “change” I had to either remove the missing values, or account for them. Since there is no way to determine whether a missing value indicates a languishing status (perhaps due to the incomplete nature of the survey) or an incomplete status, I chose to remove them as my main focus is what features aid in positive changes. Sadly, this removed 30% or 835 participant surveys.

However, upon removing the missing data, I have removed the data being skewed toward stable but un-ideal states and states of decline. For this reason I did not need to do any sort of minority boosting or downsampling the majority sample via SMOTE.

Some data had some NA and I will determine the appropriateness of imputing the missing values. Simply removing all observations with NA as a feature leaves us with no data to analyze. So the question is whether imputation or substituting NA’s with a dummy value is appropriate.

I also created a separate dataset containing only objective data, as through my modeling, I noticed a lot of the subjective indicators were strongly influencing the outcomes. I wanted to see what was contributing to these subjective indicators, so I created a dataset of only “objective” features.

Header File: Defining Datasets

#load and convert Stata data into R dataframe frames into environment

dim(df_full)
## [1] 1966  267
dim(df_objective)
## [1] 1966  207
# display stata datatypes to readible strings
convertStatDataType <- function(strType) {
  mapping <- c("%8.0g"="byte", "%8.0g"="int", "%12.0g"="long", "%9.0g"="float", "%10.0g"="double", "%#s"="str#", "%9s"="strL")
  mapping[strType]
}

# extract the tiblr dataset inforamtion and display them
convertDataframeDisplay <- function(df_convert) {
  dfDisplay <- data.frame(matrix(ncol = 0, nrow = ncol(df_convert)))
  vectName <- vector(length = ncol(df_convert))
  vectLabel <- vector(length = ncol(df_convert))
  vectFormat <- vector(length = ncol(df_convert))
  vectDescriptions <- vector(length = ncol(df_convert))
  
  for(i in 1:ncol(df_convert)) {
    vectName[i] <- names(df_convert[, i])
    if(is.null(attributes(df_convert[[i]])$label) == F)
      vectLabel[i] <- attributes(df_convert[[i]])$label
    if(is.null(attributes(df_convert[[i]])$format.stata) == F)
      vectFormat[i] <- lapply(attributes(df_convert[[i]])$format.stata, convertStatDataType)
    if(is.null(attributes(df_convert[[i]])$class) == F && attributes(df_convert[[i]])$class == "factor") {
      vectFormat[i] <- "factor"
      vectDescriptions[i] <- str_flatten(attributes(df_convert[[i]])$levels, "<br />")
    }
    if(is.null(attributes(df_convert[[i]])$class) == F && attributes(df_convert[[i]])$class != "factor" && attributes(df_convert[[i]])$class[3] == "double")
      vectFormat[i] <- "double"
    if(is.null(attributes(df_convert[[i]])$labels) == F && length(attributes(df_convert[[i]])$labels) > 0) {
      vectLabels <- list(length(attributes(df_convert[[i]])$labels))
      for(j in 1:length(attributes(df_convert[[i]])$labels)) {
        vectLabels[j] <- as.character(attributes(df_convert[[i]])$labels[j])
        if(is.null(names(attributes(df_convert[[i]])$labels[j])) == F && as.character(names(attributes(df_convert[[i]])$labels[j])) != "Actual number")
          vectLabels[j] <- paste(c(vectLabels[j], ":", as.character(names(attributes(df_convert[[i]])$labels[j]))), collapse = " ")
        else if(as.character(names(attributes(df_convert[[i]])$labels[j])) == "Actual number")
          vectLabels[j] <- paste0(vectLabels[j], ":::")
      }
      vectDescriptions[i] <- str_flatten(vectLabels, "<br />")
      vectDescriptions[i] <- str_replace_all(vectDescriptions[i], ":::<br />", ", ")
    }
  }
  dfDisplay$ColumnName <- vectName
  dfDisplay$Label <- vectLabel
  dfDisplay$Format<- vectFormat
  dfDisplay$Labels <- vectDescriptions
  return(dfDisplay)
}

2.1 Filtered Dataframe aka Full Dataset

df_display <- convertDataframeDisplay(df_full_display)
datatable(df_display, options = list(
  pageLength=10,
  lengthMenu=c(10,50,100,150,250,300)
  ),
  escape=F
)

2.2 Filtered Dataframe with Only Objective Data aka Objective Dataset

df_objective_display <- convertDataframeDisplay(df_objective_display)
datatable(df_objective_display, options = list(
  pageLength=10,
  lengthMenu=c(10,50,100,150,250,300)
  ),
  escape=F
)

3 Data Distribution

plot_ly(x = ~mhstatus, type="histogram") %>%
  layout(title = "Distribution of Mental Health Status", xaxis = list(title = "Mental Health Status"), bargap=0.1,
         legend = list(orientation = 'h', title = list(text = "<b>Mental Health</b>")))

3.1 Comparison Between Training and Testing Data

#full dataset
prop.table(table(df_full$PositiveChange))
## 
##      Positive Change  Not Positive Change 
##            0.5320448            0.4679552
prop.table(table(df_full_train$PositiveChange))
## 
##      Positive Change  Not Positive Change 
##            0.5381679            0.4618321
prop.table(table(df_full_test$PositiveChange))
## 
##      Positive Change  Not Positive Change 
##            0.5076142            0.4923858
#objective dataset
prop.table(table(df_objective$PositiveChange))
## 
##      Positive Change  Not Positive Change 
##            0.5320448            0.4679552
prop.table(table(df_objective_train$PositiveChange))
## 
##      Positive Change  Not Positive Change 
##            0.5381679            0.4618321
prop.table(table(df_objective_test$PositiveChange))
## 
##      Positive Change  Not Positive Change 
##            0.5076142            0.4923858

3.2 To Impute or Not to Impute.

We will compare the mean and median for data that has been imputed vs data that have not been imputed. While the patterns appear similar, the y-axis for both the means and medians are different. Based upon the split violin plot, the ranges on the imputed data are wider and for nominal / categorical data, it does not seem appropriate to alter the integrity of such data. For this reason, I will not utilize imputation.

library(vioplot)

# display violin plot of left side being unimputed data vs right side that is imputed
plot_violin <- function(df, df_impute) {
  filter <- c("yaearnings_10", "Q23L30B")
  df <- df[ , !names(df) %in% filter]
  df_impute <- df_impute[ , !names(df_impute) %in% filter]
  df_unimputed_long <- gather(df, feature, measurement)
  feature_unimputed <- df_unimputed_long$feature
  feature_label_unimputed <- unlist(lapply(feature_unimputed, fieldname_to_description))
  measurement_unimputed <- df_unimputed_long$measurement
#  measurement_unimputed <- unlist(lapply(df_unimputed_long$measurement, function(x) { if(x == "-Inf") x = 0; return(x)}))
  df_unimputed_plotly <- data.frame(Feature = feature_unimputed, FeatureLabel = feature_label_unimputed, Measurement = measurement_unimputed)

  df_imputed_long <- gather(df_impute, feature, measurement)
  feature_imputed <- df_imputed_long$feature
  feature_label_imputed <- unlist(lapply(feature_imputed, fieldname_to_description))
  measurement_imputed <- df_imputed_long$measurement
#  measurement_imputed <- unlist(lapply(df_imputed_long$measurement, function(x) { if(x == "-Inf") x = 0; return(x)}))
  df_imputed_plotly <- data.frame(Feature = feature_imputed, FeatureLabel = feature_label_imputed, Measurement = measurement_imputed)

  p <- plot_ly(type="violin")
  p <- p %>% 
    add_trace(data = df_unimputed_plotly, x = ~FeatureLabel, y = ~Measurement, 
      legendgroup = "Unimputed", scalegroup = "Unimputed", name="Unimputed", side = "negative", 
      color = I("green"), box = list(visible = T), meanline = list(visible = T)
  )
  p <- p %>% 
    add_trace(data = df_imputed_plotly, x = ~FeatureLabel, y = ~Measurement, 
      legendgroup = "Imputed", scalegroup = "Imputed", name="Imputed", side = "postive", 
      color = I("orange"), box = list(visible = T), meanline = list(visible = T)
  )
  p <- p %>% layout(#zeroline = F, violingap = 0, violinmode = "overlay", violingroupgap = 0, 
    title = "Distribution of Unimputed vs Imputed data", 
    xaxis = list(title = "Fieldnames", range = c(12,16)), 
    yaxis = list(title = "Distribution", range = c(1, 5))
  )
  p
  return(p)
}

# plot the means and the medians of the fieldnames
plot_summary <- function(df, string_type) {
  sum_mat <- summary(df)
  vec_medians <- sum_mat[3, ]
  vec_means <- sum_mat[4, ]
  vec_fieldnames <- trimws(colnames(sum_mat))
  for(i in 1:length(vec_medians)) {
    vec_medians[i] <- as.numeric(str_replace(vec_medians[i], "Median :", ""))
    vec_means[i] <-  as.numeric(str_replace(vec_means[i], "Mean   :", ""))
  }
  df_new <- data.frame(
    medians = vec_medians,
    means = vec_means,
    fieldnames = vec_fieldnames
  )
  p <- plot_ly()
  p <- p %>% add_trace(data = df_new, x = ~fieldnames, y = ~means,  type = "bar", name="Means")
  p <- p %>% layout(title=paste0("Distribution of Means of Summary for ", string_type), yaxis = list(title = "Count", xaxis = list(title = "Field Names")))
  q <- plot_ly()
  q <- q %>% add_trace(data = df_new, x = ~fieldnames, y = ~medians, type = "bar", name="Medians")
  q <- q %>% layout(title=paste0("Distribution of Medians of Summary for ", string_type), yaxis = list(title = "Count", xaxis = list(title = "Field Names")))

  return(list(p, q))
}

set.seed(1234)
df_full_impute_forest <- as.data.frame(missForest::missForest(as.matrix(apply(df_full_na, 2, as.numeric)), maxiter=1)$ximp)
##   removed variable(s) 267 due to the missingness of all entries
##   missForest iteration 1 in progress...done!
plot_violin(df_full_na, df_full_impute_forest)
plotted_summary_imputed <- plot_summary(df_full_impute_forest, "Full Dataset - Random Forest")
plotted_summary_unimputed <- plot_summary(df_full_na, "Full Dataset - Non Imputed")

# Means
plotted_summary_imputed[[1]]
plotted_summary_unimputed[[1]]
# Medians
plotted_summary_imputed[[2]]
plotted_summary_unimputed[[2]]

4 Conclusion

Out of all of models I had tested, the C5.0 model using the full dataset performed the best. Unfortunately, I was unable to determine the best predictive features for this study. There were simply too many features in my clustering and aporiori selection techniques. I may want to try imputing the data to see if that improves the performance.

4.1 References

Panel Study of Income Dynamics. (n.d.). Institute of Social Research. Survey Research Center. https://psidonline.isr.umich.edu/