data cleaning and exploratory data analysis (EDA) on the selected dataset (Data Science)

2 min readFeb 15, 2024

Conduct data cleaning and exploratory data analysis (EDA) on a selected dataset, like the Titanic dataset available on Kaggle. Investigate inter-variable relationships, unveiling patterns and trends within the data. This comprehensive analysis enhances understanding and lays the foundation for informed decision-making in subsequent analytical processes.

Dataset :- https://www.kaggle.com/code/vbmokin/eda-for-tabular-data-advanced-techniques

To conduct data cleaning and exploratory data analysis (EDA) on the selected dataset, we’ll follow these steps:

1. **Data Loading**: Load the dataset into a pandas DataFrame.
2. **Initial Exploration**: Explore the structure of the dataset, check for missing values, and get summary statistics.
3. **Data Cleaning**: Handle missing values, deal with outliers, and address any inconsistencies in the data.
4. **Exploratory Data Analysis**: Analyze the relationships between variables, visualize distributions, and identify patterns and trends.
5. **Inferential Statistics**: Perform statistical tests or calculations to draw insights from the data.
6. **Visualization**: Create visualizations such as histograms, scatter plots, and correlation matrices to better understand the data.

Let’s start with loading the dataset and performing initial exploration:

```python
import pandas as pd

# Load the dataset
df = pd.read_csv(“eda-for-tabular-data-advanced-techniques/train.csv”)

# Display the first few rows of the dataset
print(df.head())

# Check for missing values
print(df.isnull().sum())

# Get summary statistics
print(df.describe())
```

Next, we’ll handle missing values and perform further exploration and analysis based on the characteristics of the dataset.

```python
# Handle missing values (for example, by imputation or removal)
# For example, if ‘Age’ has missing values, you could fill them with the mean or median
df[‘Age’].fillna(df[‘Age’].median(), inplace=True)

# Exploratory Data Analysis
# Visualize distributions of numerical variables
import matplotlib.pyplot as plt
import seaborn as sns

# Histogram of Age
plt.figure(figsize=(8, 6))
sns.histplot(df[‘Age’], bins=20, kde=True)
plt.title(‘Distribution of Age’)
plt.xlabel(‘Age’)
plt.ylabel(‘Frequency’)
plt.show()

# Boxplot of Fare
plt.figure(figsize=(8, 6))
sns.boxplot(x=’Survived’, y=’Fare’, data=df)
plt.title(‘Survival by Fare’)
plt.xlabel(‘Survived’)
plt.ylabel(‘Fare’)
plt.show()

# Explore relationships between variables
# For example, correlation matrix
correlation_matrix = df.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap=’coolwarm’, fmt=”.2f”)
plt.title(‘Correlation Matrix’)
plt.show()
```

Continue exploring the dataset based on your analytical goals and hypotheses. You can analyze categorical variables, relationships between variables, and perform statistical tests as needed.

This code provides a basic framework for data cleaning and exploratory analysis. Adjustments and additional steps may be necessary based on the specific characteristics and goals of your analysis.

data cleaning and exploratory data analysis (EDA) on the selected dataset (Data Science)

Written by ByteUprise

No responses yet