The diamonds Dataset in Python


The Diamonds dataset comes with seaborn, which is a Python library.

Below is some Python code you can use to work with this dataset. It’s purpose is for data professionals to practice working with datasets. Below is the data dictionary.

  • price – price in US dollars ($326–$18,823)
  • carat – weight of the diamond (0.2–5.01)
  • cut – quality of the cut (Fair, Good, Very Good, Premium, Ideal)
  • color – diamond colour, from D (best) to J (worst)
  • clarity – a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
  • x – length in mm (0–10.74)
  • y – width in mm (0–58.9)
  • z – depth in mm (0–31.8)

You could use feature engineering to create a coupe of columns, as listed below.

  • depth – total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)
  • table – width of top of diamond relative to widest point (43–95)

Let’s get into it by writing some code in Python, in Jupyter notebook.

# libraries 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the diamond dataset into a Pandas dataframe
df = sns.load_dataset('diamonds')

df.head(15).T    # dot T will transpose the data.

Click to Enlarge

df.info()

Leave a comment

Your email address will not be published. Required fields are marked *