Question

1 Approved Answer

Posted on May 24, 2024

Factors encouraging cycle commuting in Scotland Big Data Fundamentals Coursework Daniel Devine CS982 Big Data Technologies Computer and Information Sciences University of Strathclyde, Glasgow 5th

Factors encouraging cycle commuting in Scotland Big Data Fundamentals Coursework Daniel Devine CS982 Big Data Technologies Computer and Information Sciences University of Strathclyde, Glasgow 5th November 2018 Contents List of Figures ii List of Tables iii 1 Introduction 1 2 Dataset - 2011 Census Records 2 2.1 Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.2 Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.3.1 Method of Travel . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.3.2 Accommodation Type . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3.3 General Health . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3.4 Social Grade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.5 Qualifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3 Unsupervised Analysis - Clustering 15 3.1 Agglomerative Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 K-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4 Supervised Approach 19 4.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 i Contents 4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5 Reflections 24 6 Conclusion 25 A Environment 26 Bibliography 26 ii List of Figures 2.1 Methods of Transport Distribution . . . . . . . . . . . . . . . . . . . . . 4 2.2 Methods of Transport Heatmap . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 Accommodation Type Distribution . . . . . . . . . . . . . . . . . . . . . 6 2.4 Accommodation Type Heatmap . . . . . . . . . . . . . . . . . . . . . . . 6 2.5 General Health Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.6 General Health Heatmap . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.7 Social Grade Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.8 Social Grade Heatmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.9 Education Level Distribution . . . . . . . . . . . . . . . . . . . . . . . . 11 2.10 Education Level Correlation . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.11 Heatmap of All Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.1 Comparison of Agglomerative Methods . . . . . . . . . . . . . . . . . . . 17 3.2 K-Means Clustering for Di?erent Numbers of Clusters . . . . . . . . . . 18 iii List of Tables 2.1 Census Tables Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.2 Social Grades and Descriptions . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Education Levels and Descriptions . . . . . . . . . . . . . . . . . . . . . 11 4.1 Performance of Logistic Regression Method . . . . . . . . . . . . . . . . 20 4.2 Performance of Linear Regression Method . . . . . . . . . . . . . . . . . 21 4.3 Coecients of Linear Regression Method . . . . . . . . . . . . . . . . . . 22 iv Chapter 1 Introduction Cycling commuting is alien to the majority of Scottish citizens, but why is this? Its widely publicised and common knowledge that cycling is a healthy, low cost and environmentally sustainable method of transport. The stats make for concerning reading. About 2.8% of people aged over 16 cycled on at least 1-2 days as a means of transport in the previous week when asked in 2016; 2.1% cycled on 3-5 days, and 1% on 6-7 days. (Cycling UK 2018) This is at odds with much of the rest of Europe. The only EU countries with fewer individuals who ride daily during the week are Cyprus (2%) and Malta (1%) (European Commission 2013). Although lack of infrastructure is a major barrier to cycling, there is a plethora of other factors that may or may not be conducive to a high uptake of cycling. This report examines 2011 census data in an e?ort to identify relations between assorted population data and current cycle habits. Extracting these more subtle indicators or motivators may allow for smarter prioritisation of infrastructure or education to areas which will react more favourably and, optimistically, lead to trickle-down of attitudes. 1 Chapter 2 Dataset - 2011 Census Records 2.1 Aims 2.2 Source All data used is freely available on the Scotlands Census website. A number of tables have been combined to create the final dataset. All acquired tables hold 2011 data and are in percentage comma separated value form. See Table 2.1 for a list of data used. Rows are common across all tables, with data provided for each of the 6,976 SNS (Scottish Neighbourhood Statistics) zones. SNS zones are geographical divisions holding between 500 and 1,000 individuals, zones obey local authority and municipal boundaries. The zone to which any data belongs is preserved throughout to allow for caseby-case analysis when necessary, such as explaining outliers. All data are stored in a two-dimensional Pandas Dataset of dimension 30x6,976 Table 2.1: Census Tables Used Table Title Columns QS702SC Method of travel to work or study 12 QS402SC Accommodation type - Households 5 QS302SC General health 6 QS613SC Approximated social grade - People aged 16 to 64 5 QS501SC Highest level of qualification 6 2 Chapter 2. Dataset - 2011 Census Records representing 6,976 elements each containing one index parameter; the SNS zone, and 29 percentage variables. 2.3 Analysis Preliminary analysis of each table is performed and described below to act as a reference for later more advanced techniques. 2.3.1 Method of Travel The Method of Travel table contains data collected on all persons over the age of four who are in full-time education, and all those between the ages of 16 and 74 who are in employment or full-time education. Respondents were only able to select their primary method of transport, so if their journey was comprised of say, a bus journey to a train station and then a train to their workplace then it is at their discretion which mode to select. Ideally the census form would have used checkboxes in place of a radio type selection. Figure 2.1 shows the spread of how people in di?erent zones commute. By far the most common method of transport is by car. The mean percentage of those driving is 41.3%, if car passengers are included this rises to 50.4%. More people commute by car than all other modes combined. Intriguingly, the range in data of people driving is second only to those who walk. The area with the most driving is in South Aberdeenshire, the least being Glasgow City Centre East. This seems intuitive, those living in central Glasgow likely live within walking distance to work and have a plethora of public transport at their disposal. Aberdeenshire contains some of the most rural areas in the UK and is generally poorly served by public transport. Cycling has the third lowest number of commuters out of those modes surveyed with a mean of 1.23%. Only motorcycling (0.22%), underground (0.25%) and taxi (0.72%) were less popular. The areas with the most cyclists were Edinburgh Marchmont West 1 (13.53%), Edinburgh Marchmont West 2 (11.74%), and Edinburgh Marchmont East 3 Chapter 2. Dataset - 2011 Census Records Figure 2.1: Methods of Transport Distribution and Sciennes - 04 (11.66%). The Marchmont area of Edinburgh is adjacent to the Edinburgh Meadows, which has extensive o? and on-road cycle paths. A large number of areas, 722 of 6,976, had zero respondents commute by bicycle. One might expect these to be the most remote areas, however that does not appear to be the case with large swathes of Lanarkshire, one of the most urbanised local authorities, included in this list. There are some correlations to note, see Figure 2.2. Areas where many people drive also have many commuting as a car passenger. Areas where walking is popular also have higher bike commuting. There is a strong negative correlation between car areas and areas where many commute by bus. The same negative correlation exists between car areas and walk areas. The predominant correlation is negative across the whole heatmap, this is a result of all variables being normalised and summing to 100. Where one mode is particularly popular, it is physically necessary for others to be low. Examining purely the correlations related to cycle commuting, positive correlations are apparent with underground, bus, motorcycle and walking. Negative correlations exist between cycling and train, taxi and car. These two groups match up with the average distance one might travel by each of these modes. This is the urban/rural divide. 4 Chapter 2. Dataset - 2011 Census Records Figure 2.2: Methods of Transport Heatmap 2.3.2 Accommodation Type The Accommodation Type table contains data on all abodes in Scotland, the proportion of each type is given for all datazones. Across the whole country, the mean percentage for all types is quite consistent being around 20%. Areas exist where almost all abodes are of one type. More than 25% of datazones have zero of one or more abode types since the lower whisker it on the zero line for all types. See Figure 2.3. Accommodation type is a good indicator of how dense an area is, ie. the level of urbanisation. Areas predominantly occupied by apartments tend to have very few detached or semi-detached dwellings. See Figure 2.4. 5 Chapter 2. Dataset - 2011 Census Records Figure 2.3: Accommodation Type Distribution Figure 2.4: Accommodation Type Heatmap 6 Chapter 2. Dataset - 2011 Census Records 2.3.3 General Health General health data has been collected for the whole population, the question asks how people would describe their health qualitatively. This indicator will be used to find links between the healthiness of an area and the methods of transport chosen. Figure 2.5 shows the distribution of responses. Most areas follow the same distribution, where the largest proportion are in very good health, with decreasing numbers reporting ever poorer health. Figure 2.5: General Health Distribution Looking at interrelations, areas where fair or worse health is prominent tends to also have a disproportionate number of people also reporting poor health. See Figure 2.6. On the face of it, there appears to be three clumps: areas with very good health, areas with good and fair health, and areas with fair and worse health. 7 Chapter 2. Dataset - 2011 Census Records Figure 2.6: General Health Heatmap 8 Chapter 2. Dataset - 2011 Census Records 2.3.4 Social Grade The Social Grade table contains data on all individuals aged 16 to 64. For each datazone, the percentage of applicable residents in each approximated social grade is given. The grades and their meaning are explained in Table 2.2. Note that this statistic is derived using multiple other responses, individuals did not self-identify. Table 2.2: Social Grades and Descriptions Grade Description AB Intermediate and Higher Managerial/Administrative/Professional C1 Supervisory and Junior Managerial/Administrative/Professional C2 Skilled Manual Workers DE Semi and Unskilled Manual Workers, Unemployed, On State Benefit The spread of social grades across areas is large. Areas exist where they are almost wholly occupied by C1 individuals, see Figure 2.7. The heatmap for this table shows significant division. There is a clear correlation between AB and C1, and also between C2 and DE. There is a very strong inverse correlation between AB and DE, few (if any) areas contain a large proportion of both. The reason for this is likely down to the drawing of boundaries, as far as possible streets and estates are kept within a single area. 9 Chapter 2. Dataset - 2011 Census Records Figure 2.7: Social Grade Distribution Figure 2.8: Social Grade Heatmap 10 Chapter 2. Dataset - 2011 Census Records 2.3.5 Qualifications The final census table to be analysed is Highest Level of Qualification. Data is present for all individuals over age 16. The data is classified in levels, these are explained in Table 2.3. Table 2.3: Education Levels and Descriptions Level Description 1 Scottish Standard Grade / GCSE 2 Scottish Higher / A-Level 3 College Qualification 4 Degree or Higher The proportions of education level vary widely across datazones, with a large number of outliers in Figure 2.3. Intriguingly the proportion of individuals educated to college level is the lowest and by some margin. No datazones exist where greater than 30% of the population is trained to college level (and no higher). Figure 2.9: Education Level Distribution There are three pairings where a positive correlation is observed. Areas where there are a large number of individuals without formal education also house a large number of individuals not educated beyond mid-high school. There is a strong correlation also 11 Chapter 2. Dataset - 2011 Census Records between those educated to Higher/A-Level and Degree individuals. This may be due to the historic association of college qualifications and more vocational jobs. A weaker, but also positive, correlation exists between Higher/A-Level educated individuals and College Educated individuals. See Figure 2.10. Figure 2.10: Education Level Correlation 2.3.6 Summary Looking at all of the tables together, Figure 2.11, some basic observations can be made about what factors correlate with cycling. Health Areas with predominantly very good health has a strong positive correlation, good health neutral, and anything worse negative. Commuting Mode Walking heavy areas have a very strong positive correlation, bus, motorcycle and underground are also positive. Travelling by car has a negative correlation. 12 Chapter 2. Dataset - 2011 Census Records Housing Type Areas with high proportions of apartment have a strong positive correlation, all other housing stock is fairly neutral. Social Grade Areas where people are majorly classified as AB or C1 are much more likely to cycle. Interestingly, C2 is less likely than D2. Education Level Areas with many university and high school educated individuals have positive correlation, others are neutral. 13 Chapter 2. Dataset - 2011 Census Records Figure 2.11: Heatmap of All Variables 14 Chapter 3 Unsupervised Analysis - Clustering The objective here is to find groupings of datazones, ideally these groups will all have similar levels of cycling. The dataset is not particularly well suited to unsupervised analysis as we know there are already strong groupings within some of the tables. Further to this, the data is on a continuous scale. To mitigate this, the percentage cycling has been rounded to the nearest integer. 13 di?erent values now exist, which will be treated as discrete/logical values. Before rounding this was 3,205. With only 6,976 pieces of data this is wholly unsuitable. Two approaches are examined, hierarchical clustering and K-Means clustering. 3.1 Agglomerative Clustering Agglomerative clustering is a method of hierarchical clustering where data begins in an ungrouped state. Groups are formed by gradually associating datapoints with other similar data. The closeness of data is determined by calculating the distance between points (the field in which the points exist is of n-dimensions where n is the number of attributes associated with a datapoint). A datapoint is a singular SNS datazone in this case. Because the problem isnt well suited to clustering, all linkage and anity strategies 15 Chapter 3. Unsupervised Analysis - Clustering within Scikit-Learn(Pedregosa et al. 2011) (the Python package used) are examined as not to miss any unexpectedly positive results. The method used to minimise the distance between data is known as linkage, ScikitLearn has four options (Scikit Learn 2018): Ward Minimise the variance by minimising the sum of squared distances between items within clusters. Complete Minimise the greatest distance between pairs of clusters. Average Minimise the mean distance between all points within pairs of clusters. Single Minimuse the distance between the closest point between clusters. Ward linkage generally finds clusters of roughly equal size, since the pervasive value of cycle commuters is zero its unlikely this will fit the data well. Complete and average work very similarly, because the data has many peaks (outliers), average might be the better choice. Single is unlikely to perform well, it requires very few calculations in comparison to the others and as such is less resilient to noisy data. From analysis of the dataset, there appears not to be an obvious recipe so this might be the least consistent method. The distance measurement need not only be Euclidean (as the crow flies), Manhattan distance sums the distance that must be travelled in each dimension, and Cosine calculated the angle distance between points (with respect to zero). Figure 3.1 shows the resulting scores of the di?erent distance measurements for each of the clustering strategies. Interestingly, the distance measurement appears to have greater influence on the results than the algorithm. The metrics used are: Silhouette Score A measure of how similar an item is to other items within the same cluster compared to items in other clusters. Range -1 to 1, 1 represents complete clustering. Completeness Score Proportion of items of a single type within a single cluster, clustering results are compared to the desired outcome for a test set. Range 0 to 1, where 1 represents all items are in the correct cluster. 16 Chapter 3. Unsupervised Analysis - Clustering Homogeneity Score Proportion of items within a cluster that are of the same type. Range 0 to 1, where 1 represents all items within the cluster are of the same type. All method perform poorly, no method places more than 10% of the data within the correct cluster. As one might have expected, the average linkage performed the best, but it still is not useable. Figure 3.1: Comparison of Agglomerative Methods 3.2 K-Means K-Means is another clustering method, this time the number of clusters is fixed. The process is more straightforward than Agglomerative Hierarchical Clustering, n random points are selected at random to be the centre of n clusters. Each point is then assigned to the closest cluster, and the new cluster centre point calculated. This is repeated for all points. Clustering is performed for 2 to 50 clusters, the performance of each number of clusters is shown in Figure 3.2. If the approach were successful we would see 12 clusters 17 Chapter 3. Unsupervised Analysis - Clustering scoring best as this is the number of unique scores as per the target (percentage cycle commuting). K-Means performs best where there are four groups. This in itself neednt be a bad sign, four groups may contain zero, a low amount, a moderate and a high number of cyclists. However, as the performance metrics are poor this isnt the case. Figure 3.2: K-Means Clustering for Di?erent Numbers of Clusters 18 Chapter 4 Supervised Approach Supervised methods involve training a system to classify data by showing it examples of similar data with the correct classification. For our case, data means all quantities other than percentage cycling, and the classification is the percentage cycling. The purpose of this analysis is to find what parameters influence the amount cycling, and how. Regression methods are not black boxes, and are a series of coecients that can be analysed. As such, regression methods form the bulk of the analysis here. The proportion of data reserved for testing is 30% of the total. 4.1 Logistic Regression Logistic regression works on the principle of how likely a binary condition is to be satisfied based on some set of inputs. Its still applicable here even though there is not a binary condition. The problem can be set up as a series of binary conditions such as Is it zero? or Is it nine?. The condition with the greatest probability is the output. Table 4.1 shows the performance of logistic regression. The total scores appear quite reasonable, placing three quarters in the right category (percentage). Closer examination shows that the system did not once classify correctly for 4, 5, 6, 8, 9, 11 and 13%. The dataset is highly biased, zero and other low percentage labels occupy almost the whole of the dataset. Precision Proportion of true positives to all positives. Of all that the system identified 19 Chapter 4. Supervised Approach as correct, how many really were correct? Recall Proportion of all positives that were identified as positive. F1 A weighted average of precision and recall. Table 4.1: Performance of Logistic Regression Method Percentage Precision Recall F1 Occurrences 0 0.89 1.00 0.94 1232 1 0.51 0.68 0.58 475 2 0.11 0.01 0.03 204 3 0.18 0.07 0.10 82 4 0.00 0.00 0.00 44 5 0.00 0.00 0.00 23 6 0.00 0.00 0.00 13 7 1.00 0.18 0.31 11 8 0.00 0.00 0.00 6 9 0.00 0.00 0.00 1 11 0.00 0.00 0.00 2 13 0.00 0.00 0.00 0 total 0.66 0.75 0.69 2093 4.2 Linear Regression Linear regression is similar, however it is mathematically simpler by working on a continuous scale. Each attribute has a linear coecient, the sum of all of these attributes multiplied by their coecient gives the output. Its important not to over-fit the coecients. Defining and assessing success is less straightforward for linear regression compared to logic regression. Predictions are on a continuous scale, so without manipulation of the output (which could be argued skews results) a di?erent metric is required. The three used are: Mean Absolute Error The average percentage the prediction is incorrect by. Mean Squared Error The sum of the squares of the average percentage the prediction is incorrect by. 20 Chapter 4. Supervised Approach R2 Score Also known as the coecient of determination, is 1 minus the ratio of the sum of squared errors of the model against the sum of squared errors of a baseline model (no regression). A result of 1 indicates the model is perfect, zero means the result is identical to the baseline. The results of the linear regression fit are shown in Table 4.2. A comparison is made between running the algorithm once, running the algorithm ten times using the KFold technique, and a trivial control run. The trivial algorithm predicts values randomly, obeying the (very uneven in this case) distribution. Linear regression performs much better than simply guessing. Table 4.2: Performance of Linear Regression Method Linear Regression Cross Verified Trivial Mean Absolute Error 0.24502 0.24082 1.17344 Mean Squared Error 0.08179 0.08033 3.9083 R2 0.95837 0.95854 -0.9894 4.3 Discussion In the current form of the problem, logistical regression will always perform poorly. It will likely perform much better if the spread of classifiers were more even. This could be forced by removing many of the zero datapoints, or by creating many more instances of less represented classifiers. Doing this must be attempted carefully as not to introduce some other error or skew in the data. A KFold type arrangement where the act of removing and fitting is repeated many times for di?erent packets of random surplus data would likely deliver the most representative result. Duplicating data will artificially strengthen potentially fluke relationships. The linear regression technique worked much more favourably. Table 4.3 displays the coecients, and what other quantity they relate to. There is big variation in the order of magnitude of each of the indicators, this is somewhat down to the scaling of the parameters. Scaling was not explicitly performed as the parameters are all of the same order of magnitude and in percentage form. 21 Chapter 4. Supervised Approach On inspection of the leading two significant figures of the attributes it seems that di?erent levels of health have little influence on cycling to work, the same can be said for social grade and education level. Interestingly, a greater proportion of residents in terraced housing results in more cycling. Somewhat bizarrely the other two-wheeled mode, motorcycling, was least conducive. These results should not be trusted as absolutes, the strange scaling is reason for additional questioning. Table 4.3: Coecients of Linear Regression Method Attribute (% of Population) Coecient (2 s.f.) Very Good Health -54,000 Good Health -54,000 Fair Health -54,000 Bad Health -54,000 Very Bad Health -54,000 No Commute -0.96 Underground -0.97 Train -0.96 Bus -0.96 Taxi -0.94 Drive -0.96 Car Passenger -0.96 Motorcycle -0.99 Walk -0.96 Other -0.97 Detached 0.00074 Semi Detached 0.00034 Terraced 0.0011 Apartment 0.00081 AB 9,200 C1 9,200 C2 9,200 DE 9,200 No Education -5,000 Basic School -5,000 Advanced School -5,000 College -5,000 University -5,000 Using a Decision Tree yields marginally better results for non-zero outcomes, however the resulting tree is extremely large and impossible to manually interpret so is of 22 Chapter 4. Supervised Approach no value here. Other methods are more challenging to reverse engineer so are also of no value in this case. 23 Chapter 5 Reflections I perhaps wouldnt have selected these data if restarting the assignment, or maybe would have considered a question that was better answerable using the techniques learned in class. Perhaps it would have been better to stick to a single table and go into more depth. There were few datapoints for high percentages of cyclists, and the high percentages for cycling were relatively low in magnitude compared with the other variables. I did try grouping all of the zeros, ones, and greater than ones together so there were only three groups. This marginally helped, the data was still unbalanced, but the loss of resolution meant it was no longer useful. The aim was to find out why some places had a greater proportion cycling, if you can no longer tell what these areas are then its no use. The data did not suit clustering much, perhaps there are too many variables. Because I dont know what variables have what e?ect, I included lots in the hope that the ones which are significant would stand out. This did not happen. The signal to noise ratio probably was not at all favourable. 24 Chapter 6 Conclusion The analyses on the whole has been inconclusive as to the main human factors promoting cycle commuting. There may be even be no significant link between the selected data. Basic observation of correlation results in quite sensible outputs. It makes sense that areas with healthy, educated, individuals living in apartments are more likely to cycle than unhealthy, more time-poor individuals living in detached houses that are likely far from the workplace. Unsupervised methods do not fit the data well, the connections are too weak and there is too much noise. Supervised regression methods did a better job than clustering, but its debatable whether the output is valid. The coecients from the best performing method, linear regression, are counter intuitive. Why would areas with apartments be least likely to cycle? This requires more work, perhaps a restructuring of the input data. Overall, no secret recipe has been found. 25 Appendix A Environment Language: Python 3.6 IDE: Spyder 3.2.3 26 Bibliography Cycling UK (2018). Transport Statistics: Increase in cycle trac welcomed by Sustrans Scotland. url: https://www.cyclinguk.org/resources/cycling-uk-cyclingstatistics#How%5C%20many%5C%20people%5C%20cycle%5C%20and%5C%20how% 5C%20often? (visited on 25/10/2018). European Commission (2013). Attitudes of Europeans Towards Urban Mobility. In: Special Eurobarometer 3, p. 10. url: http://ec.europa.eu/commfrontoffice/ publicopinion/archives/ebs/ebs_406_en.pdf. Pedregosa, F. et al. (2011). Scikit-learn: Machine Learning in Python. In: Journal of Machine Learning Research 12, pp. 28252830. Scikit Learn (2018). Clustering. url: http://scikit-learn.org/stable/modules/ clustering.html (visited on 25/10/2018). 27

Attachments:

sample-assign....pdf