data cleaning and exploratory data analysis (EDA) on the selected dataset (Data Science)
Conduct data cleaning and exploratory data analysis (EDA) on a selected dataset, like the Titanic dataset available on Kaggle. Investigate inter-variable relationships, unveiling patterns and trends within the data. This comprehensive analysis enhances understanding and lays the foundation for informed decision-making in subsequent analytical processes.
Dataset :- https://www.kaggle.com/code/vbmokin/eda-for-tabular-data-advanced-techniques
To conduct data cleaning and exploratory data analysis (EDA) on the selected dataset, we’ll follow these steps:
1. **Data Loading**: Load the dataset into a pandas DataFrame.
2. **Initial Exploration**: Explore the structure of the dataset, check for missing values, and get summary statistics.
3. **Data Cleaning**: Handle missing values, deal with outliers, and address any inconsistencies in the data.
4. **Exploratory Data Analysis**: Analyze the relationships between variables, visualize distributions, and identify patterns and trends.
5. **Inferential Statistics**: Perform statistical tests or calculations to draw insights from the data.
6. **Visualization**: Create visualizations such as histograms, scatter plots, and correlation matrices to better understand the data.
Let’s start with loading the dataset and performing initial exploration:
```python
import pandas as pd
# Load the dataset
df = pd.read_csv(“eda-for-tabular-data-advanced-techniques/train.csv”)
# Display the first few rows of the dataset
print(df.head())
# Check for missing values
print(df.isnull().sum())
# Get summary statistics
print(df.describe())
```
Next, we’ll handle missing values and perform further exploration and analysis based on the characteristics of the dataset.
```python
# Handle missing values (for example, by imputation or removal)
# For example, if ‘Age’ has missing values, you could fill them with the mean or median
df[‘Age’].fillna(df[‘Age’].median(), inplace=True)
# Exploratory Data Analysis
# Visualize distributions of numerical variables
import matplotlib.pyplot as plt
import seaborn as sns
# Histogram of Age
plt.figure(figsize=(8, 6))
sns.histplot(df[‘Age’], bins=20, kde=True)
plt.title(‘Distribution of Age’)
plt.xlabel(‘Age’)
plt.ylabel(‘Frequency’)
plt.show()
# Boxplot of Fare
plt.figure(figsize=(8, 6))
sns.boxplot(x=’Survived’, y=’Fare’, data=df)
plt.title(‘Survival by Fare’)
plt.xlabel(‘Survived’)
plt.ylabel(‘Fare’)
plt.show()
# Explore relationships between variables
# For example, correlation matrix
correlation_matrix = df.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap=’coolwarm’, fmt=”.2f”)
plt.title(‘Correlation Matrix’)
plt.show()
```
Continue exploring the dataset based on your analytical goals and hypotheses. You can analyze categorical variables, relationships between variables, and perform statistical tests as needed.
This code provides a basic framework for data cleaning and exploratory analysis. Adjustments and additional steps may be necessary based on the specific characteristics and goals of your analysis.