Libraries needed

library(tidyverse)
library(knitr)
library(readxl)
library(zoo)

Data

data = read.csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv')
popdata = read_xls('../data/PopulationEstimates.xls', skip = 2)

Question 1

Steps

1. See Data tab

2. Making a California Subset

caldata <- data %>% 
  filter(state=='California') %>%
  group_by(date) %>% 
  mutate(newcases = cases - lag(cases)) %>% 
  ungroup()

Simple, easy four lines of code to understand. Simply put, I took the data, filtered to California, grouped each county by date, created the new cases variable, and ungrouped the whole thing for later analysis.

3. Generating Two tables

table1 <- caldata %>% 
  group_by(county) %>% 
  summarise(cases=sum(cases)) %>% 
  arrange(-cases) %>% 
  head(5)

table2 <- caldata %>% 
  group_by(county) %>% 
  summarise(newcases=sum(newcases)) %>% 
  arrange(-newcases) %>% 
  head(5)

tables<-kable(table1, caption = 'Top 5 Cases Counts by County', col.names = c("County", "Cases"))

tables
Top 5 Cases Counts by County
County Cases
Los Angeles 24790654
Riverside 5010717
Orange 4599288
San Bernardino 4265567
San Diego 3939250

4. & 5. See data tab

Here we needed the first two rows to be skipped as the developer included two “title” rows which made the inputting of the data in R weird.

6. Exploring the Data

Using these basic functions we can determine which fields we want to join. For example, State is the same for both my caldata and popdata sets. Similarly, but not exactly the same, both data sets have a FIPS code, popdata refers to it as FIPStxt while caldata refers to it as just FIPS, and is only a 4 digit code when necessary (the 0 in some of the FIPS have been removed in the caldata set). We also know that there are 3,273 entries and 165 different variables to describe each entry.