Skip to the content.

Prepare to Use the Divvy Dataset

Andrew Luyt
Last updated: Friday August 13, 2021

Summary

Data Rights

The data being used is available under this licence. The agreement allows us the right to independently analyze this dataset.

Dataset summary

Let’s examine the most recent file from the raw dataset.

names(df)  # What variables are in the dataset?
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
skimr::skim(df)
   
Name df
Number of rows 729595
Number of columns 13
_______________________  
Column type frequency:  
character 5
factor 2
numeric 4
POSIXct 2
________________________  
Group variables None

Data summary

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ride_id 0 1.00 16 16 0 729595 0
start_station_name 80093 0.89 10 53 0 689 0
start_station_id 80093 0.89 3 35 0 689 0
end_station_name 86387 0.88 10 53 0 690 0
end_station_id 86387 0.88 3 35 0 690 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
rideable_type 0 1 FALSE 3 cla: 435020, ele: 242859, doc: 51716
member_casual 0 1 FALSE 2 cas: 370681, mem: 358914

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
start_lat 0 1 41.90 0.04 41.64 41.88 41.90 41.93 42.07 ▁▁▇▇▁
start_lng 0 1 -87.64 0.03 -87.78 -87.66 -87.64 -87.63 -87.52 ▁▁▇▂▁
end_lat 717 1 41.90 0.04 41.51 41.88 41.90 41.93 42.08 ▁▁▁▇▁
end_lng 717 1 -87.64 0.03 -87.86 -87.66 -87.64 -87.63 -87.49 ▁▁▇▆▁

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
started_at 0 1 2021-06-01 00:00:38 2021-06-30 23:59:59 2021-06-14 19:46:47 589805
ended_at 0 1 2021-06-01 00:06:22 2021-07-13 22:51:35 2021-06-14 20:13:55 589069

Observations & Sanity Checks

Can we infer the station names based on coordinates?

About 10% of the rides don’t have station names, and in a later step will be dropped from the analysis. If it became necessary to try to impute the names and include them in a later analysis, is it possible to infer what the stations were, based on the longitude and latitude recorded?

Summary: Yes, looks like it, but we won’t because it isn’t necessary now.

Let’s see what the mean latitude & longitude is for each station name. Below, we see that it appears a station can be specified to three or four decimal places of precision. The standard deviations of the means of the coordinates is about five decimal places of precision. At first glance the stations can indeed be clearly identified by their map coordinates.

Later, if we wished to pursue this imputation of data, we could do it with a few simple methods:

In the context of finding differences between user types, prediction errors like placing a trip start at a station a few blocks away should not bias the analysis much, and would also give us an extra 10% of data to work with.

start_station_name mean_x mean_y sd_x sd_y
2112 W Peterson Ave -87.68359 41.99118 1.88e-05 1.96e-05
63rd St Beach -87.57626 41.78095 9.22e-05 5.56e-05
900 W Harrison St -87.64980 41.87475 2.34e-05 1.25e-05
Aberdeen St & Jackson Blvd -87.65480 41.87773 3.87e-05 2.77e-05
Aberdeen St & Monroe St -87.65554 41.88042 5.38e-05 3.09e-05
Aberdeen St & Randolph St -87.65427 41.88412 2.34e-05 1.62e-05
Ada St & 113th St -87.65543 41.68756 5.92e-05 1.01e-05
Ada St & Washington Blvd -87.66120 41.88283 2.55e-05 2.12e-05
Adler Planetarium -87.60728 41.86610 3.50e-05 2.22e-05
Albany Ave & 26th St -87.70203 41.84449 3.17e-05 3.45e-05
start_station_name start_lat
Aberdeen St & Monroe St 41.88037
Aberdeen St & Jackson Blvd 41.87773
Aberdeen St & Jackson Blvd 41.87775
Aberdeen St & Jackson Blvd 41.87772
Aberdeen St & Jackson Blvd 41.87775
Aberdeen St & Jackson Blvd 41.87770
Aberdeen St & Jackson Blvd 41.87770
Aberdeen St & Monroe St 41.88056
Aberdeen St & Jackson Blvd 41.87783
Aberdeen St & Monroe St 41.88052