SF Demographic Data Presented Effectively
San Francisco please present your race and ethnicity data effectively.
From San Francisco’s dashboard presentation on race and ethnicity, you cannot quickly tell the races and ethnicities most affected by this crisis. The San Francisco graph has a flaw that is widespread in data presentation. It is a flaw almost ubiquitous across all of the COVID-19 dashboards. The 22% of COVID-19 cases shown as being Hispanic in San Francisco’s chart only becomes alarming once you know that only 15% of the San Francisco population is Hispanic and / or the fact that non-Hispanic Whites outnumber Hispanics by almost 3 to 1. In geek speak San Francisco hasn’t normalized their data.
Hispanics are 3 to 4 times as likely to get COVID-19 than whites and asians. It also looks like Pacific Islanders could be 5 to 8 times as likely to get COVID-19 than whites and asians.
Which races and ethnicities are at risk?
In my presentation I normalized San Francisco’s data with the prevalence of those races and ethnicities in San Francisco whole population. From my presentation you can easily see that hispanics, pacific islanders, other races are most at risk from COVID-19. When you understand how I created the red error bars then you’ll know why I also think blacks should be considered a race at risk for COVID-19. You can also see that whites and asians are much less likely to have or have had COVID-19 than average.
Error Bars a sign of quality.
Error bars distinguish what you know from what you don’t know. If someone has created a chart or graph and has put error bars on their presentation, then you know that at least they’ve considered what they don’t know from the data. In my case I looked at that huge 32% of cases that were unknown race / ethnicity and asked myself how might that effect the results of my chart. I took these unknown cases into account in a simplistic way.
My baseline calculation (the blue bars) uses San Francisco’s raw percentages (ie 22% for hispanics). This baseline assumes that none of the unknown cases fell into any of the well identified categories. (ie none of the unknown individuals were hispanic).
The uncertainty calculation (the red bar) takes the unknown individuals and splits them up and assigns them proportionally to all the well defined races and /ethnicities. This eliminates the unknown category and increases the percentages in all the other categories because these unknown cases must have some race or ethnicity. This process raises the hispanic percentage from 22% to 33%.
Where my analysis is weak.
One thing my analysis doesn’t do is to take into account the sample size or the number of individuals in each category. There is only one individual in all of San Francisco who has COVID-19 and has identified themselves as Native American. This is also true for the multi-race / multi-ethnic category which has only 7 individuals in it. It is somewhat true for the Pacific Islander category which has only 13 people in it. These small sample sizes give us more uncertainty. Properly taking this uncertainty into account would give my chart much bigger error bars on the right than on the left where there are hundreds of individuals in each category.
I didn’t have the time or bandwidth to properly calculate these proper error bars but I did want to call out this issue.
San Francisco deserves recognition for having demographic data available.
Obviously COVID-19 is affecting different races much differently from other races. San Francisco deserves kudos for tracking the case numbers in this manner. It is really sad that very few other counties are doing the same demographic tracking.
Please San Francisco make your dashboard more compelling. More people visit your dashboard than are going to see this post.