I wrote this report for a class, and hadn’t really put it online before, but I feel like it’s pretty interesting nonetheless. I just quickly pasted the text and charts in. For those not interested in reading the full report, here is an excerpt from the conclusion that shows the main takeaway:
Interestingly, income was not among the statistically significant predictors. The strongest predictor, meanwhile, was the percentage of the population that is white. This is true even when accounting for the facts that: whites have a higher income that non-whites; whites live in areas with a higher population; and that there is no obvious reason for race to be a factor at play. …
We have shown that the racial demographics of a bike station’s Voronoi cell [surrounding area] have a statistically significant impact on its ridership, which was the goal of the study.
Citi Bike is New York City’s public bicycle sharing system. The system began operation in 2013, and grew rapidly in size and usage. Originally serving only Lower and Midtown Manhattan, the system was expanded in 2015 when stations were added in some adjacent parts of Jersey City, Upper Manhattan and Brooklyn. As of June 2016, the bikeshare’s implementation in these new areas is only beginning to show results.
This study will determine if the racial demographics around a bike station have a significant effect on usage, even when population, age and income are accounted for.
When looking at a use frequency map of Citi Bike stations (Map 1), we notice areas which have a visibly lower number of uses per station. This pattern is especially interesting in Lower and Midtown Manhattan.
Stations in Brooklyn, Jersey City and Upper Manhattan have lower ridership due to their recent implementation. However, the reasons for diminished numbers on the Lower East Side are not as clear. These areas, while having a distinct set of racial demographics, also feature characteristics such as lower income and slightly higher median age. Lower incomes may mean that these people have jobs in areas outside Manhattan, making a bicycle commute unfeasible. A higher median age, meanwhile, may be associated with decreased physical health and vigor; this would make biking unsuitable for transport.
The population surrounding a station also impacts its usage, as can be seen in Times Square or the Financial District. This study will attempt to disentangle race from the other possible explanatory variables, and will determine if the racial composition of a bike station’s surrounding area can explain variations in usage when population, income and age have been accounted for.
Variables and Initial Assumptions
Population, income and age: why these variables?
A review of literature in Transport Reviews magazine notes that bikeshare users look for convenient travel, cost savings, proximity and matching purpose. Convenience, cost savings and purpose of bike use are not variables that are currently directly measured ways useful to our study.
Nonetheless, convenience is frequently cited by policy experts as the most crucial factor in the success of a bikeshare program. According to a publication by the National Association of City Transport Officials, people are generally not willing to walk more than 1,000 feet to use a public bicycle. This means that bikeshare users must either live within 1,000 feet of a station, or use other transport to get there. 84% of all rides in 2015 were by Citi Bike annual members, who overwhelmingly are residents of Lower and Midtown Manhattan. Therefore we can estimate the population closest (Voronoi) to the bike station, within 1000 feet, and use that as a possible explanatory variable.
It is also reasonable to assume that cost savings and purpose vary with income. Lower-income people have a greater incentive to save money on transportation. On the other hand, lower-income residents of Lower and Midtown Manhattan have long commutes which make bikeshare impractical. Either way income is a relevant variable, for which there exists block group-level census data. For the purposes of the study, we must assume that the internal variation in income within a block group is not significant, and that each census block within a group can be represented as having the estimated median income of the block group.
Based on the literature, age is also a relevant indicator. The convenience of riding a bike depends directly on a person’s physical ability to do so, which on average varies fairly consistently with age. Also, age has a significant influence on cultural or ideological values, which may affect a person’s desire to use a bikeshare. Median age data exists at the precise census block level, making it well suited for this study.
How race could play a role
Unlike age and income, race has no apparent connection to bikeshare usage. It is possible that a neighborhood’s racial composition can affect who travels in and out, due to self-segregation. Race may also be indicative of cultural values which, whether internally or externally imposed, may have an affect on ridership numbers. This study will try to determine whether or not an area’s racial composition has a significant impact on bikeshare usage when other variables are accounted for.
Procedure, Part I (Setup)
Overall, the procedure consists of data preparation, some field calculation, and an Ordinary Least-Squares (OLS) regression. Some parts of the code are omitted for brevity. The full python script can be found in the appendix.
Initial data preparation: U.S. Census files
Since the U.S. Census data came in the form of CSV files, the pandas, csv and numpy packages were extensively useful. First we imported the packages and read the .csv files, as well as defining some local variables for arcpy to use later.
Since the race and age data are on the same (block) level and had common IDs, it is easy to merge the two data frames:
# Merge block-level data race_and_age = pd.merge(race, age, on='GEO.id2', how='inner')
As mentioned previously, income data is only available on the block group level, so it must be extrapolated to the census blocks. This involves some clever use of sliced ID strings.
# First, make a column with census block group IDs race_and_age['key'] = race_and_age['GEO.display-label_x'].str[12:] race_and_age.to_csv('race_and_age.csv', index=False) # Second, create the income column based on block group data income['key'] = income['GEO.display-label'] race_age_income = race_and_age.merge(income, on=['key'])
To simplify the data tables for future use, we then drop unnecessary columns using the pd.drop function. We also change the data type of some columns where numbers were stored as strings. In addition to cleaning up the columns, a numeric ID column has to be converted to strings, in order to be compatible with ArcGIS’s way of interpreting CSV data types. Due to various glitches and incompatibilities, this is accomplished through a crude workaround, but the result works (see Appendix).
Initial data preparation: Citi Bike data
Citi Bike provides an extensive collection of data on their public website. Like the census data, Citi Bike data comes in the form of multiple CSV files. Each CSV file contains every trip made by every user during a given month. Our goal is to get a table that shows every station ID, its coordinates, and how many uses it had in 2015. First we load the twelve individual files. In order to get a dataset for all of 2015, we simply concatenate the files for each month.
# Join them into one for all of 2015 months = [jan, feb, mar, apr, may, jun, jul, aug, sep, ocb, nov, dec] all_2015 = pd.concat(months)
Each trip has two endpoints, a start station and an end station. In some cases, this may be the same station, which would mean that this station is used two times. To get a list of every endpoint, we first extract the start and end stations separately.
# Get station ID's and coordinates for starts and ends all_starts = all_2015[[3, 5, 6]] all_ends = all_2015[[7, 9, 10]]
Having created two new data frames, we rename the columns so that they match up, and then proceed to concatenate. The end result is a list of every endpoint of every trip made in 2015.
all_starts.columns = ['station id', 'station latitude', 'station longitude'] all_ends.columns = ['station id', 'station latitude', 'station longitude'] # Concatenate all_endpoints = pd.concat([all_starts, all_ends])
The final step is to create a frequency table. This is done by adding a column to our new data frame, all_endpoints, which contains a count of how many times the endpoint’s station ID appears in the table. Of course, this means that for all the endpoint with a given station ID, the table’s row looks identical. To get the final frequency table we want, we simply drop duplicate rows with a single function.
# Add freq column all_endpoints['freq'] = all_endpoints.groupby('station id')['stationid'].transform('count') # Drop duplicate rows stations = all_endpoints.drop_duplicates()
Our Citi Bike data is now ready. Next, we will perform the GIS analysis and regression.
Preparing GIS files for OLS regression
We have already defined the local variables in the very beginning of our script. Now we can use our Citi Bike CSV, census data, and a 2015 TIGER/Line shapefile of Manhattan census blocks for the real spatial analysis.
First we project our Citi Bike stations by converting the latitude and longitude into the proper notation for ArcGIS. Then we clip the stations layer so that we only deal with the stations that existed since the beginning of 2015. This is done with a polygon that was manually created beforehand. Once the stations layer is finalized, we create Voronoi polygons for each station, clipped to 1,000 feet (Map 2).
As for the census data CSV, we first use a “Table to Table” toolto ensure that the CSV is compatible with ArcGIS. Then we can join the CSV table to the Manhattan census block shapefile’s attribute table. Before we perform the intersect and dissolve to get the variables for each station’s polygon, we have to create a field for the total age and total income within a block:
# Calc fields for income and age estimate (population weighted) arcpy.AddField_management( manh__3_, "total_age", "LONG", "", "", "", "", "NULLABLE", "NON_REQUIRED", "") arcpy.AddField_management( manh__2_, "total_income", "LONG", "", "", "", "", "NULLABLE", "NON_REQUIRED", "") arcpy.CalculateField_management( manh__6_, "total_age", "[Total_pop] * [Median_age]", "VB", "") arcpy.CalculateField_management( manh__8_, "total_income", "[Total_pop] * [Median_income]", "VB", "")
This is so that the age and income for the Voronoi polygons can be estimated by simply adding up these sums, then dividing by how many people live in the buffer. This accounts for the fact that not all census blocks within a station’s buffer will have the same population, so a simple “MEAN” merge rule would not work.
Finally we perform an Intersect of the station polygons and census blocks, preserving all attributes. We then dissolve the result by station ID, so that we end up with single Voronoi polygons for each station, with all the needed attributes. After using Calculate Field to get the desired columns, we are able to perform an OLS regression.
Procedure, Part II (Regression)
The Ordinary Least Squares regression is carried out in a step-wise manner to determine which combination of variables best explains station usage. This is slightly different from the Exploratory Regression function that comes default with ArcGIS, since we have more control over which combinations of variables we would like to try. In addition, we get to see every regression report, which helps diagnose possible problems with the data.
Before we begin identifying the best model step- wise, we must make sure that there are no major biases, and that influential observations are accounted for. First we perform a regression without removing any outliers, or transforming the response variable in any way. The first model is:
Where White is the estimated percentage of the population that is white, Income is the estimated median income in thousands, and Age is the estimated median age in years, of the Voronoi cell’s population. Straight away the resulting residual plot indicates some seriously influential outliers. These very high residuals are caused by cells with major bike station usage but little population, such as the Times Square neighborhood or Financial District. This effect can also be seen in areas where low population is compounded with high commercial activity and many parks, such as along the Hudson.
Repairing the data
Many of the population-adjusted usage values in the orange cells in Map 3 are more than three standard deviations above the mean. In general, it makes sense to remove areas with very low population, since our explanatory variables are all characteristics of a Voronoi cell’s residents. Virtually every extreme value for age, income, percent white, or bike usage was associated with a very low population value. This makes sense, since a lower population gives less of a chance for errors to be normally distributed.
Once the extreme values had been removed, the residual plot became more satisfactory, but still displayed severe heteroscedasticity (see plot 2). Based on this, the second adjustment is to calculate a new column, called Log_freq, which is equal to the logarithm of the number of rides at a particular station. The evidence in favor of this can be seen in the residual plot, which shows a highly characteristic logarithmic pattern.
Finding the best model
With these two modifications we can proceed to attempt finding the best model. As mentioned previously, the process is done through a modified stepwise procedure with the help of a python for-loop. The regression setups were simply placed into a list, and each iteration of the for loop ran a regression with one particular setup.
We choose the best model according to R2. With an adjusted R2 of 0.246, the winning model is:
Explanation of variables
- Population – the total number of people living in the station’s Voronoi cell.
- White – the percentage of the Voronoi cell’s population that is white, as of the 2010 Census.
- Income – the estimated median income in the Voronoi cell.
- Age – the estimated median age of the Voronoi cell’s residents.
- White:Income – the interaction term between White and Income.
- White:Age – the interaction term between White and Age.
Even though the best model had an adjusted R2 of 0.246, meaning that only 24.6% of variance in the logarithm of station use was explained, we had found several statistically significant indicators. According to Prof. Schuble, we are only 5.4% away from a Nobel Prize in Economics. Interestingly, income was not among the statistically significant predictors. The strongest predictor, meanwhile, was the percentage of the population that is white. This is true even when accounting for the facts that: whites have a higher income that non-whites; whites live in areas with a higher population; and that there is no obvious reason for race to be a factor at play.
Areas of high population tend to attract more events, businesses and other incentives for people to bike there. As expected, age is negatively correlated with bike usage, possibly due to the health effects discussed above. Looking at the interaction terms, white residents do not act similarly across all income levels. The negative coefficient, which is highly significant, indicates that Voronoi cells which are wealthier and racially whiter actually experience a slight decrease in ridership.
The actual reasons why race is a factor in bikeshare usage are very difficult to understand. Auto-segregation seems like a plausible explanation—people of a given race prefer to stay amongst members of their own racial group. Of course, this theory is entirely speculative and the subject requires further study. Commute time is an important factor for which the data were not adequately precise, though the American Community Survey provides estimates at a census tract level.
We have shown that the racial demographics of a bike station’s Voronoi cell have a statistically significant impact on its ridership, which was the goal of the study.