- EDA Structuring with Pandas
- Data Type Conversion in pandas
- Groupby Multiple Columns
Convert Strings to Floats
This post discusses one of the EDA Structuring tasks using Python and pandas. Here we are working with a pandas DataFrame. We need to convert a column of strings to floats. We can see that the data is in millions of dollars. You may copy and paste this Python code into Jupyter Notebook, for example, and follow along.
import pandas as pd data = {'company': ['ABC Inc.', 'XYZ Corp.', 'Acme Ltd', 'Widget LLC'], 'sales': ['$1.286M', '$6.722M', '$3.320M', '$4.197M'], 'industry': ['Technology', 'Foods', 'Foods', 'Technology'], 'date_founded': ['2/25/2006', '5/17/2003', '3/7/2011', '11/2/2012']} df = pd.DataFrame(data) df
# Convert the sales column to a numeric df['sales'] = df['sales'].str.strip('$M').astype(float) df
So now we are set.
Convert Numeric to String
Here we use astype() again: df[‘sales’] = df[‘sales’].astype(str)
Remove Commas
Suppose we had commas in our column of numbers. How would we get rid of those? Strip(‘,’) doesn’t work You could use the following code.
df2['sales'] = df2['sales'].str.replace(',','').astype(float)
Convert an Object to a Datetime
Suppose your DataFrame was called df. Suppose your column was called date. Here’s how you could convert the date column from an object to a datetime.
df['my_date'] = pd.to_datetime(df['my_date'])
You can use format also. Here’s an example.
# Convert `my_date` column to datetime format df['my_date'] = pd.to_datetime(df['my_date'],format='%m/%d/%Y %I:%M:%S %p') print('Data type of my_date:', df['my_date'].dtype)
Convert a Float to a Boolean (0 or 1)
Suppose you have a column with floating point numbers and you wish to create another column of zeros and ones. Suppose you want the number to be 1 if the floating point number is at or above 20%, or 0.2. Your DataFrame is called df and the column is called percent. Your new column is called generous_tipper.
df['generous_tipper'] = df['percent'] # create new column df['generous_tipper'] = (df['generous_tipper'] >= 0.2) # convert to True or False df['generous_tipper'] = df['generous_tipper'].astype(int) # convert to an integer