-
Notifications
You must be signed in to change notification settings - Fork 2
Pandas coding puzzles
Q1: You've got 2 tables called Table T1, and Table T2. Please perform the following joins and tell me the number of records in each of these joins. Please convert each of these tables in dataframe first and perform following joins.
Left Join, Right Join, Inner Join.
T1
Col_1
1
2
3
4
4
NaN
T2
Col_1
1
1
1
2
3
3
4
5
NaN
NaN
Ans:
Certainly! To perform these joins, we can use Python with the pandas library to create dataframes and then perform the left join, right join, and inner join. Here's how you can do it:
```python
import pandas as pd
# Creating the dataframes
data1 = {'Col_1': [1, 2, 3, 4, 4, None]}
data2 = {'Col_1': [1, 1, 1, 2, 3, 3, 4, 5, None, None]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
# Performing Left Join
left_join_result = pd.merge(df1, df2, on='Col_1', how='left')
# Performing Right Join
right_join_result = pd.merge(df1, df2, on='Col_1', how='right')
# Performing Inner Join
inner_join_result = pd.merge(df1, df2, on='Col_1', how='inner')
# Displaying the results
print("Left Join Result:")
print(left_join_result)
print("\nNumber of records in Left Join:", len(left_join_result))
print("\nRight Join Result:")
print(right_join_result)
print("\nNumber of records in Right Join:", len(right_join_result))
print("\nInner Join Result:")
print(inner_join_result)
print("\nNumber of records in Inner Join:", len(inner_join_result))
Note: The None
values in the data have been represented as NaN
(Not a Number) in the resulting dataframes.
Left Join Result:
Col_1
0 1.0
1 1.0
2 1.0
3 2.0
4 3.0
5 3.0
6 4.0
7 4.0
8 NaN
9 NaN
Number of records in Left Join: 10
Right Join Result:
Col_1
0 1.0
1 1.0
2 1.0
3 2.0
4 3.0
5 3.0
6 4.0
7 4.0
8 5.0
9 NaN
10 NaN
Number of records in Right Join: 11
Inner Join Result:
Col_1
0 1.0
1 1.0
2 1.0
3 2.0
4 3.0
5 3.0
6 4.0
7 4.0
8 NaN
9 NaN
Number of records in Inner Join: 10
import pandas as pd
df = pd.DataFrame({'row':['00000 UNITED STATES',
'010001 ALABAMA','01001 AUTUAGA COUNTY, AL',
'01002 Baldwin County,AL','01003 Barbour County, AL']})"
Ans:
Here are five different methods to separate numerical values from city names using the given DataFrame:
import pandas as pd
df = pd.DataFrame({'col': ['00000 UNITED STATES',
'010001 ALABAMA',
'01001 AUTUAGA COUNTY, AL',
'01002 Baldwin County,AL',
'01003 Barbour County, AL']})
df['Numerical_Values'] = df['col'].str.extract('(\d+)')
df['City_Names'] = df['col'].str.extract('([a-zA-Z\s]+)')
print(df[['Numerical_Values', 'City_Names']])
import pandas as pd
df = pd.DataFrame({'col': ['00000 UNITED STATES',
'010001 ALABAMA',
'01001 AUTUAGA COUNTY, AL',
'01002 Baldwin County,AL',
'01003 Barbour County, AL']})
df[['Numerical_Values', 'City_Names']] = df['col'].str.split(n=1, expand=True)
print(df[['Numerical_Values', 'City_Names']])
import pandas as pd
df = pd.DataFrame({'col': ['00000 UNITED STATES',
'010001 ALABAMA',
'01001 AUTUAGA COUNTY, AL',
'01002 Baldwin County,AL',
'01003 Barbour County, AL']})
df['Numerical_Values'] = df['col'].str.replace('[^0-9]', '').str.strip()
df['City_Names'] = df['col'].str.replace('[0-9]', '').str.strip()
print(df[['Numerical_Values', 'City_Names']])
import pandas as pd
df = pd.DataFrame({'col': ['00000 UNITED STATES',
'010001 ALABAMA',
'01001 AUTUAGA COUNTY, AL',
'01002 Baldwin County,AL',
'01003 Barbour County, AL']})
def extract_numerical_and_city(row):
numerical_values = ''.join(c for c in col if c.isdigit())
city_names = ''.join(c for c in col if not c.isdigit()).strip()
return pd.Series([numerical_values, city_names])
df[['Numerical_Values', 'City_Names']] = df['col'].apply(extract_numerical_and_city)
print(df[['Numerical_Values', 'City_Names']])
import pandas as pd
df = pd.DataFrame({'col': ['00000 UNITED STATES',
'010001 ALABAMA',
'01001 AUTUAGA COUNTY, AL',
'01002 Baldwin County,AL',
'01003 Barbour County, AL']})
df['Numerical_Values'] = df['col'].str.findall(r'\d+').apply(lambda x: ''.join(x))
df['City_Names'] = df['col'].str.findall(r'[a-zA-Z\s]+').apply(lambda x: ''.join(x).strip())
print(df[['Numerical_Values', 'City_Names']])
-
[a-zA-Z]
: This is a character class that matches any single alphabetical character. The rangea-z
covers lowercase letters, and the rangeA-Z
covers uppercase letters. Combined,[a-zA-Z]
matches any alphabetical character, regardless of case. -
\s
: This is a shorthand character class representing any whitespace character, such as space, tab, or newline. -
+
: The plus sign indicates that the preceding character class ([a-zA-Z\s]
) should match one or more times consecutively. This allows the regex to capture multiple alphabetical characters along with whitespace characters in sequence.
Putting it all together, ([a-zA-Z\s]+)
captures one or more sequences of alphabetical characters (both uppercase and lowercase) and whitespace characters as a single group. This regex is often used to extract words or names from a text containing only letters and spaces.
Each method uses a different approach, so you can choose the one that best fits your preferences or requirements.
import pandas as pd
# Create a sample DataFrame
data = {'Full_Name': ['John Doe', 'Jane Smith', 'Bob Johnson']}
df = pd.DataFrame(data)
# Split the 'Full_Name' column into 'First_Name' and 'Last_Name'
df[['First_Name', 'Last_Name']] = df['Full_Name'].str.split(expand=True)
# Drop the original 'Full_Name' column if needed
df.drop('Full_Name', axis=1, inplace=True)
# Display the updated DataFrame
print(df)