

So really the different seating classes where represented in triplicate within the data set compared to other variables, which is why the methods separated the aircraft in this way. However I believe the way I structured the data may have biased it towards categorizing the aircraft by seating class, as that quality was replicated in the data set compared to other variables, being represented both in quantitative variables (seat pitch & width, number of seat in class) and categorical (class presence). Īll in all, the techniques did a pretty admirable job in separating out the different type of aircraft into distinct categories. Looking at the raw data, the difference I can ascertain between the largest two clusters is that all the aircraft in the one have first class seating, whereas all the planes in the other have business class instead. # Compare presence of seat classes in largest clusters

This is because the scale of the different variables in the data set is quite variable we can see this by plotting the variance of the different columns in the data frame (regular scaling on the left, logarithmic on the right):īarplot(sapply(data, var), horiz=T, las=1, cex.names=0.8)īarplot(sapply(data, var), horiz=T, las=1, cex.names=0.8, log='x') Loadings(pc) # Can see all variance is in the range in miles Summary(pc) # 1 component has > 99% variance Looking at the first column of loadings, we see that the first principle component is just the range in miles. Taking that approach we can see that the first principal component has a standard deviation of around 2200 and accounts for over 99.8% of the variance in the data. # Naively apply principal components analysis to raw data and plot Next let's say I know nothing about dimensionality reduction techniques and just naively apply principle components to the data in R: How do we easier visualize a high-dimensional data set like this one? By using a dimensionality reduction technique like principal components analysis. In this way the engines variable is really more like a categorical variable, but we shall as the analysis progresses that this is not really important, as there are other variables which more strongly discern the aircraft from one another than this. There is one lone outlier which has four, while all the other aircraft have two. The exception here is the variable right in the middle which is the number of engines. Remarkably there is an almost perfectly linear relationship between wingspan and tail height, which perhaps is related to some principle of aeronautical engineering of which I am unaware. We can see that there are pretty strong positive correlations between all these variables, as all of them are related to the aircraft's overall size. These binary categorical variables were transformed into quantitative variables by assigning them values of either 1 or 0, for yes or no respectively.


The data set comprises 33 variables on 44 aircraft taken from, including both quantitative measures on attributes like cruising speed, accommodation and range in miles, as well as categorical data on, say, whether a particular aircraft has Wi-Fi or video. What can we say about the different aircraft in Delta’s fleet, coming at it with ‘fresh eyes’? Which planes are similar? Which are dissimilar? I thought this would be an interesting data set on which to do some analysis, as it has both quantitative and qualitative information and is relatively complex. So I happened to notice in one of my many visits to Delta’s website that they have data on all of their aircraft in a certain site section. Apparently the story with frequent fliers miles is that it’s best just to pick one airline and stick with it – and this also worked out well as most companies, including my employer, have preferred airlines and so you often don’t have much of a choice in the matter. The point is that, as I said, I spent quite a bit of time travelling for work last year.
