Data science and women's cycling (and interactive graphs!)

Hi! I’m a 3rd time Zwift Academy participant excited to get ready for another round in this wonderful community. :blush: I’m also a self-professed data nerd and recently did some analysis on the demographics of the elite women’s peloton using data I found on the UCI website and wrote it up into a blogpost complete with interactive graphs that might be of interest to other people here. :chart_with_upwards_trend::bar_chart: Check it out!
https://ameliabarber.netlify.com/2019/08/04/data-sciencing-womens-cycling-pt-1/

(Unfortunately due to the interactive plots I don’t think it will display super well on mobile devices. So sorry in advance for that).

Update: I did a followup post looking at the relationship between rider ranking, age, and number of years as a pro that can be found here.

Looking forward to another great year in the supportive community that is the Zwift Women’s Academy

Amelia

6 Likes

Wow that blog post looks epic!! Great effort. My first thought was “did you use R?” !

I’m just completing the Microsoft Data Science Specialisation at edX. I took the Python track rather than R (because I want to progress to TensorFlow). Did you find R easy to learn?

You’ve posted the mean years, do you know the median for “How many years do women stay in the peloton?” It must be close to 1 !!

Ride On!

2 Likes

Hi Amelia - I really enjoyed your post, its so hard to find consolidated info on women’s cycling, so thnsk for putting it together.

Good luck with the rest of the academy!

1 Like

Thanks for the kind words and sorry for the delayed response! I guess I have notifications turned off :thinking:

The median number of years is 2 for a female pro to stay in the peloton, which I still found surprisingly low. I also have the data for male riders on UCI teams of all tiers (World Tour, Pro Continental, and Continental) during the same time period and looked real quick and found that the median for male pros is also 2 seasons, but the mean to be higher at 3.39 seasons.

I am relatively new to data science having just started learning R in December, but found it very intuitive thanks to the Tidyverse packages. I’m trying to teach myself the Python data science suite in my free time, but I keep shoving it to the back burner to work on projects like this :wink:.

Thanks Rebecca :blush:

I hope to do some more posts regarding women’s pro cycling, so stay tuned!

Ok thanks for that. Could you tell me roughly how long the visualisations took to create with R? I’m assuming the ETL were all straightforward because the data could be downloaded from the UCI website. The data science truism is that data preparation takes 80% of the total time. So if that part is taken care of when good quality data is supplied, how long do R visualisations like yours take to create? Python is hard work (but I’ve heard R is tougher!)!

The data from the UCI was very clean and required very minimal wrangling and error correction on my part, which was a pleasant surprise. As far as how long the visualizations took - are we including exploring the data and messing around with multiple options to find the best way to visualize a particular figure – or just writing the code for the finished product? Because those are two very different things, at least for me. I’m not even sure I can give you an accurate estimate of how much time the post took. It was spread in 1-2 hour increments in my free time over two weeks, but let’s say upwards of 15 hours worth of work from downloading the data to hitting publish?

Some visualizations, like tracking of the number of cyclists over the last 15 years took only a few minutes to create. Others, like the interactive map, were a much bigger time investment because I didn’t have much prior experience with geospatial visualization and the UCI used a different country abbreviation system than the base map that’s easily accessible by ggplot (my plotting package). This was also my first time using plotly which is able to transfer a lot of the R/ggplot syntax and features, but not all of it, so some of time was spent learning how to add annotations and control aesthetics in plotly.

I actually think visualizations in R are pretty easy (and highly customizable) thanks to ggplot and its ability to build plot layers, but I’m biased because that’s the system I have spent the most time on. At the end, you can probably create the same figure in similar time with both R and python if you know what you are doing.

Ok thanks for letting me know the time involved to use R for these visualisations Amelia. Sounds like it takes about the same amount of time as Python. I don’t have any coding-language-buyers-remorse now! Ride On!