The State of Data Science (Part 2)

Kaggle has released the data for their third annual Machine Learning and Data Science Survey. I’ve only recently joined the Kaggle platform as I’ve transitioned from academia to private industry, so this seems to be an excellent opportunity to explore the backgrounds of my new data science peers.

This is the second blog post exploring this data.

Part 1 explored this year’s survey results.

Part 2, this post, brings in survey data from the first two years of this annual survey to investigate how the field has changed over the last three years in the United States.

Note: Because I am based in the United States, as are most of the data science community with which I interact with regularly in-person or on social media, this analysis is limited to the 3085 survey respondents living in the United States.

Let’s get started!

Demographics

The data field today is as disproportionately male as its ever been.

Perhaps unsurprising given how disportionately male we found the field to be in part 1, but there appears to be no trend towards increasing gender diversity among data professionals.

No clear trend in age composition

Survey data doesn’t seem to show a clear trend in the age composition of data professionals. This isn’t to say that the composition is stable or unchanging. As always, it should be noted that this survey is a not a random sample of data professionals, but a voluntary response sample survey of Kaggle users, so the composition may change widely based on how Kaggle chooses to promote the survey.

The increasingly educated data workforce

The plot above shows the highest education attainment of respondents to all three years’ surveys. The number of data professionals holding a Master’s degree shocked me in part 1, but the data show that this has been an ongoing trend, while the percent of data professionals holding professional degrees, or only some college education or a high school diploma is near zero. Whether this trend towards a more educated data workforce is due to former Bachelor’s holders seeking and attaining higher education or due to new hires disporportionately coming out of grad school is unclear.

Titles and Compensation

The ascendency of the “data scientist” title

The proportion of data professionals on Kaggle with the job title “data scientist” has increased 30% relative to 2017: from 30% to 40%. And this doesn’t seem to be simple title changing from former “data analysts”, who also have increased as a proportion of employed Kaggle respondents.

The greatest decline in relative share of the data workforce are among engineers (which include all titles with “engineer” in their title, from data engineers to SWEs) and researchers (which include all titles with “research” in their title, with the exception of research assistants, which were excluded because these positions are typically not careers).

More difficult to see here is the relative decline in the “Statistician” title, which started at a barely registerable 3.3%, but has since fallen to 2.4%.

Data scientists’ compensation has increased even as the title has become more widespread

The above plot reports the typical (median) compensation of respondents with each job title. The responses are an ordinal factor so the data is an idomatic median: arrange by the ordered factor, and take the median value. Given a vector of even length and the middle two values are two different categories (e.g. $80,000-89,999 and $90,000-99,999), the lower value will be used.

The function for this implementation is below, with credit to Hong Ooi and Richie Cotton from StackOverflow.

median.ordered <- function(x) {
  levs <- levels(x)
  m <- median(as.integer(x), na.rm = TRUE)
  if(floor(m) != m) {
  m <- floor(m)
  }
  ordered(m, labels = levs, levels = seq_along(levs))
}

On the whole, the wages of data professionals appear to be on the rise. The median compensation for respondents of every job title except for Engineers appears to be higher in 2019 than it was in 2017 (and median Engineer compensation remaining steady at a very respectable $100,000 to $125,000). Data scientists in particular have broken away to typically make $125,000 to $150,000.

Languages

Python’s meteoric rise

About two-thirds of data professionals who program today are using Python, this is a 50% increase relative to the proportion using Python in 2017. Likewise, SQL use has increased 30% during the same time, while R use has remained fairly stable at about 33% and all other languages stable at about 10%.

Wrapping up

So there you have it.

As far as I see here, here are the three big take-aways from Kaggle’s three years of survey results:

  • Unfortunately, the data workforce does not appear to becoming any more gender diverse than it was in 2017.

  • “Data scientist” as a title has become extremely widespread, and in spite of this proliference, is as highly paid as ever.

  • Python and SQL are becoming universal, but not at the expense of other languages.

Avatar
David Nield
Data Analyst

My research interests include design-based inference, network analysis, and data visualization.

Related