My First Data Visualization project: Assessing White Wine Quality


The term “Quality” for wine is regularly used by those who produce, discuss and consume wine. Equally – as the following extract from a letter written by a consumer to the wine magazine Decanter shows – it may be used by the public: In fact, it is as important for a wine aficionado to know that certain wine is top quality as knowing that certain vintage of a famous wine is clearly below the average, especially if the wine is expensive, because in this case he may save a lot of money.

Quality, as an element of wine, has a long history. The Egyptians were apparently describing wine a s ‘very good quality’ by the death of Tuthankamun in 1352 BC (Johnson, 1989), and the Romans subsequently specified the best regions and the greatest vintages (Johnson, 1989). The wine trades today adopt various mechanisms to grade wine, with things like informal consumer magazine tastings, wine shows etc. The concept of wine quality is very important to the wine industry and to wine consumers. Therefore, the perception that one buys or experiences “quality” by one’s choice may have a significant influence on the consumer decision-making process. We would like to explore the following questions with this project: - Which property of wine has the highest density? Due to which the quality of white wine is better compared to the factor which makes consumers buy more white wine? - What could improve the quality of wine with the given properties? - Which property has the highest grade rate? Why? - How do you analyze each property and it’s behaviour with respect to the overall quality of wine?

Data and Methods

The data sources we choose to use are Wine quality dataset of UCI Machine Learning Repository by Forina. The UCI machine learning repository is a collection of databases, domain theories used by machine learning community. This archive was created in 1987 by David Aha and a fellow graduate students at UC Irvine.

Quality is been an important issue within the field of consumer behaviour, although its nature and relationship to other factors such as price and value is subject to debate. This study could give us an insight into the type of quality analysis on wine and also helps us understand the consumer behaviour and also to better understand which property of wine could trigger more buyers.

Data set Construction

Data used to construct the dataset for this project include:

- Fixed Acidity (the acid that contributes to the conservation of wine)

- Volatile Acidity (the amount of acetic acid in wine at high level which leads to the unpleasant taste of vinegar)

- Citric Acid (this could be found in small amounts and can add freshness to the wine)

- Residual sugar (the amount of sugar remaining at the end of the process of fermentation)

- Chlorides (the amount of salt in wine)

- Free Sulphur Dioxide (this prevents the increase of microbes and oxidation of the wine during the process of fermentation)

- Total Sulphur Dioxide (gives the aroma and also a ting of flavor to the wine)

- Density (this is the density of water and purely depends on the water and the amount of sugar) - pH (acidic or basic a wine is on a pH scale)

- Sulfates (an additive that adds as antioxidant) - Alcohol (the percentage of alcohol present in wine)

Data Analysis Methods

In order to explore the research questions, we leveraged bar chart, line plot, boxplot, histograms, to find the trend and correlation with the quality of white wine. Some of the data was straight forward easy to understand, but we had to manipulate it for performing analysis and visualizations - for example, there was a column with sensory property where we took the range and created a group to classify high, medium and low ranges.

Data Analysis and Visualization

This dataset was not in csv format, we had to format it into csv and used for analysis. After which has 4898 observations of 13 variables.

Distribution of Properties of Wine

The most straight forward analysis we could do was check the distribution of all the wine properties. Hence we plotted the distribution across these properties, we could chlorides, residual sugar are skewed to the left and hence we need to fix the skew by taking the log of these values.

The highest count value is with properties such as fixed acidity and volatile acidity. To further examine this we run Density plots to check how these properties are associated with quality of wine.

Density Plots for Quality analysis

This plot shows the density of the property with respect to density and property of wine. The highest wine quality point is 9 and it falls on a density scale from 12 to 14.

Grouping Wine Quality

We have grouped the wine quality into 3 levels - high, medium and low. As we can see there are significantly more medium quality wine compared to high and low quality ones from the plot below. However, doing this will result in over and under representation of high and medium quality wines respectively as well as having too little data points for analysis.

In conclusion, I found that the chemical properties most closely related to density is fixed acidity. In this scatter plot we removed the medium quality wine points to ease the visualization of the difference between high and low quality wines. We can observe that high quality tend to have less density by fixed acidity. This density can be explained by the fixed acidity in 63% of the cases (R ^ 2 = 0.6308) in the high quality wines and in 54% of the cases (R ^ 2 = 0.5369) in the low quality wines.

In sum, the highest quality of wine tend to have lower density by residual sugar. But this research conducted only on white wine and covers only a few facets of the wine quality. There could be many more questions to dive deep for us to better understand and the more data collected the better we could analyze.

For detailed analysis on this paper, you can find it here.

276 views0 comments