[ ] Decide on which data set is to be analysed
[ ] Ensure a method to access the same version of the data to enable reproducing the analysis.
[ ] Visual exploration to understand the features available.
[ ] Formulate questions for Exploratory Data Analysis (EDA)
[ ] Evaluate the possible directions in terms of applying ML
[ ] Plan for a shiny app that allows viewing the answers to the above in an interactive manner.
[ ] Self-hosted Shiny app will need the Shiny server to be setup on the VPS. This is the desired setup.
[ ] Alternatively, the apps can be hosted for free on shinyapps.rstudio .
- This is okay and common as a start point, but there are several limitations to the free service, like the speed of loading, and limitation of resources used and so on.
[ ] Perform EDA
[ ] Perform ML
- Atleast 2 approaches appear to make sense: Linear Regression (+ extended methods like GLMNet) to predict the trend and K-means clustering.
[ ] Review
[ ] Publish results / report.
Datasets that will be analysed
Notes on visual exploration
Links to dataset webpages
Employee wages by occupation, annual link
- contains over 4 million rows or observations.
- The CSV file itself is 1GB.
- The number of features ~23, of which atleast 3 are of no use to the analysis and perhaps more.
Employee wages by industry, annual link