import pandas as pd

import matplotlib.pyplot as plt

y = pd.read_csv("Pakistan Largest Ecommerce Dataset.csv",low_memory=False)

y.head()

y.info()

<class 'pandas.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 26 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   item_id                584524 non-null  float64
 1   status                 584509 non-null  str    
 2   created_at             584524 non-null  str    
 3   sku                    584504 non-null  str    
 4   price                  584524 non-null  float64
 5   qty_ordered            584524 non-null  float64
 6   grand_total            584524 non-null  float64
 7   increment_id           584524 non-null  str    
 8   category_name_1        584360 non-null  str    
 9   sales_commission_code  447346 non-null  str    
 10  discount_amount        584524 non-null  float64
 11  payment_method         584524 non-null  str    
 12  Working Date           584524 non-null  str    
 13  BI Status              584524 non-null  str    
 14   MV                    584524 non-null  str    
 15  Year                   584524 non-null  float64
 16  Month                  584524 non-null  float64
 17  Customer Since         584513 non-null  str    
 18  M-Y                    584524 non-null  str    
 19  FY                     584524 non-null  str    
 20  Customer ID            584513 non-null  float64
 21  Unnamed: 21            0 non-null       float64
 22  Unnamed: 22            0 non-null       float64
 23  Unnamed: 23            0 non-null       float64
 24  Unnamed: 24            0 non-null       float64
 25  Unnamed: 25            0 non-null       float64
dtypes: float64(13), str(13)
memory usage: 208.0 MB

y.describe()

# Checking for patterns in the data
print("\nInitial Observations:")
print(f"- Dataset has {len(y):,} rows and {len(y.columns)} columns")
print(f"- Number of unique categories: {y['category_name_1'].nunique()}")
print(f"- Number of unique statuses: {y['status'].nunique()}")
print(f"- Average order value: {y['grand_total'].mean():.2f} PKR")
print(f"- Price range: {y['price'].min():.2f} to {y['price'].max():.2f} PKR")

Initial Observations:
- Dataset has 1,048,575 rows and 27 columns
- Number of unique categories: 17
- Number of unique statuses: 17
- Average order value: 5622.94 PKR
- Price range: 0.00 to 1012625.90 PKR

y.isnull().sum()

item_id                   464051
status                    464066
created_at                464051
sku                       464071
price                     464051
qty_ordered               464051
grand_total               464051
increment_id              464051
category_name_1           464215
sales_commission_code     601229
discount_amount           464051
payment_method            464051
Working Date              464051
BI Status                 464051
 MV                       464051
Year                      464051
Month                     464051
Customer Since            464062
M-Y                       464051
FY                        464051
Customer ID               464062
Unnamed: 21              1048575
Unnamed: 22              1048575
Unnamed: 23              1048575
Unnamed: 24              1048575
Unnamed: 25              1048575
dtype: int64

 
y['created_at'] = y['created_at'].fillna('Unknown Date')
print("Filled missing created_at with 'Unknown Date'")

Filled missing created_at with 'Unknown Date'

y.isnull().sum()

item_id                   464051
status                    464066
created_at                     0
sku                       464071
price                     464051
qty_ordered               464051
grand_total               464051
increment_id              464051
category_name_1           464215
sales_commission_code     601229
discount_amount           464051
payment_method            464051
Working Date              464051
BI Status                 464051
 MV                       464051
Year                      464051
Month                     464051
Customer Since            464062
M-Y                       464051
FY                        464051
Customer ID               464062
Unnamed: 21              1048575
Unnamed: 22              1048575
Unnamed: 23              1048575
Unnamed: 24              1048575
Unnamed: 25              1048575
dtype: int64

y['item_id'] = y['item_id'].fillna(0)
print("Filled item_id with 0")

Filled item_id with 0

y['status'] = y['status'].fillna('Unknown')
print("Filled missing status with 'Unknown'")

Filled missing status with 'Unknown'

y['sku'] = y['sku'].fillna('Unknown')
print("Filled missing sku with 'Unknown'")

Filled missing sku with 'Unknown'

median_price = y['price'].median()
y['price'] = y['price'].fillna(median_price)
print(f"Filled missing price with median: {median_price}")

Filled missing price with median: 899.0

y['qty_ordered'] = y['qty_ordered'].fillna(1)
print("Filled missing qty_ordered with 1")

Filled missing qty_ordered with 1

y['grand_total'] = y['grand_total'].fillna(y['grand_total'].median())
print("Filled grand_total with median")

Filled grand_total with median

y['increment_id'] = y['increment_id'].fillna('Unknown')
print("Filled missing increment_id with 'Unknown'")

Filled missing increment_id with 'Unknown'

y['category_name_1'] = y['category_name_1'].fillna('Uncategorized')
print("Filled category_name_1")

Filled category_name_1

y['sales_commission_code'] = y['sales_commission_code'].fillna('No Commission')
print("Filled missing sales_commission_code with 'No Commission'")

Filled missing sales_commission_code with 'No Commission'

y['discount_amount'] = y['discount_amount'].fillna(0)
print("Filled missing discount_amount with 0")

Filled missing discount_amount with 0

y['payment_method'] = y['payment_method'].fillna('Unknown')
print("Filled missing payment_method with 'Unknown'")

Filled missing payment_method with 'Unknown'

y['Working Date'] = y['Working Date'].fillna('Unknown Date')
print("Filled missing Working Date with 'Unknown Date'")

Filled missing Working Date with 'Unknown Date'

y['BI Status'] = y['BI Status'].fillna('Not Available')
print("Filled missing BI Status with 'Not Available'")

Filled missing BI Status with 'Not Available'

y[' MV '] = y[' MV '].fillna('Unknown')
print("Filled missing MV with 'Unknown'")

Filled missing MV with 'Unknown'

y['Year'] = y['Year'].fillna(y['Year'].mode()[0])
print("Filled Year")

Filled Year

y['Month'] = y['Month'].fillna(y['Month'].mode()[0])
print("Filled Month")

Filled Month

y['Customer Since'] = y['Customer Since'].fillna('Unknown Date')
print("Filled missing Customer Since with 'Unknown Date'")

Filled missing Customer Since with 'Unknown Date'

y['M-Y'] = y['M-Y'].fillna('Unknown')
print("Filled missing M-Y with 'Unknown'")

Filled missing M-Y with 'Unknown'

y['FY'] = y['FY'].fillna('Unknown')
print("Filled missing FY with 'Unknown'")

Filled missing FY with 'Unknown'

y['Customer ID'] = y['Customer ID'].fillna(-1)
print("Filled missing Customer ID with -1")

Filled missing Customer ID with -1

y.isnull().sum()

item_id                        0
status                         0
created_at                     0
sku                            0
price                          0
qty_ordered                    0
grand_total                    0
increment_id                   0
category_name_1                0
sales_commission_code          0
discount_amount                0
payment_method                 0
Working Date                   0
BI Status                      0
 MV                            0
Year                           0
Month                          0
Customer Since                 0
M-Y                            0
FY                             0
Customer ID                    0
Unnamed: 21              1048575
Unnamed: 22              1048575
Unnamed: 23              1048575
Unnamed: 24              1048575
Unnamed: 25              1048575
dtype: int64

# Analyzing order status of dataset
status_counts = y.groupby('status').size()
print("Order Status Distribution:")
print(status_counts)

Order Status Distribution:
status
Unknown           464066
\N                     4
canceled          201249
closed               494
cod                 2859
complete          233685
exchange               4
fraud                 10
holded                31
order_refunded     59529
paid                1159
payment_review        57
pending               48
pending_paypal         7
processing            33
received           77290
refund              8050
dtype: int64

# As percentages to get- 
status_percent = (status_counts / len(y) * 100).round(2)
print("\nPercentages:")
print(status_percent)

Percentages:
status
Unknown           44.26
\N                 0.00
canceled          19.19
closed             0.05
cod                0.27
complete          22.29
exchange           0.00
fraud              0.00
holded             0.00
order_refunded     5.68
paid               0.11
payment_review     0.01
pending            0.00
pending_paypal     0.00
processing         0.00
received           7.37
refund             0.77
dtype: float64

# to checck the categories make most money
category_revenue = y.groupby('category_name_1')['grand_total'].sum().sort_values(ascending=False)
print("Top Categories by Revenue:")
print(category_revenue.head(10))

Top Categories by Revenue:
category_name_1
Mobiles & Tablets    2.440791e+09
Uncategorized        9.100086e+08
Appliances           6.568497e+08
Entertainment        5.390480e+08
Women's Fashion      2.825779e+08
Computing            2.025457e+08
Men's Fashion        1.941390e+08
Others               1.924656e+08
Superstore           1.121309e+08
Beauty & Grooming    9.718574e+07
Name: grand_total, dtype: float64

# order count per categorywise 
category_orders = y.groupby('category_name_1').size().sort_values(ascending=False)
print("\nOrders per Category:")
print(category_orders.head(10))

Orders per Category:
category_name_1
Uncategorized        464215
Mobiles & Tablets    115710
Men's Fashion         92221
Women's Fashion       59721
Appliances            52413
Superstore            43613
Beauty & Grooming     41496
Soghaat               34011
Others                29218
Home & Living         26504
dtype: int64

# most popular/used payment methods by cx 
payment_counts = y['payment_method'].value_counts()
print("Payment Methods:")
print(payment_counts.head(10))

Payment Methods:
payment_method
Unknown            464051
cod                271960
Payaxis             97641
Easypay             82900
jazzwallet          35145
easypay_voucher     31176
bankalfalah         23065
jazzvoucher         15633
Easypay_MA          14028
customercredit       7555
Name: count, dtype: int64

# Filtreing  only completed orders to see successful payments
completed = y[y['status'] == 'complete']
completed_payments = completed['payment_method'].value_counts()
print("\nPayment Methods for Completed Orders:")
print(completed_payments.head(10))

Payment Methods for Completed Orders:
payment_method
cod                148039
Payaxis             22809
Easypay             19214
easypay_voucher     16066
jazzwallet          13505
jazzvoucher          4619
customercredit       4151
Easypay_MA           3116
cashatdoorstep        674
bankalfalah           586
Name: count, dtype: int64

# Checking if discounts are used 
y['has_discount'] = y['discount_amount'] > 0
discount_summary = y.groupby('has_discount')['grand_total'].agg(['count', 'mean'])
print("Orders with vs without Discount:")
print(discount_summary)

Orders with vs without Discount:
               count         mean
has_discount                     
False         840360  4694.096795
True          208215  9371.782473

# Sales by month in dataset 
monthly_sales = y.groupby('Month')['grand_total'].sum().sort_values(ascending=False)
print("Revenue by Month:")
print(monthly_sales)

Revenue by Month:
Month
11.0    1.977609e+09
5.0     7.072266e+08
3.0     5.603449e+08
6.0     4.725649e+08
2.0     4.082264e+08
7.0     3.894242e+08
8.0     3.473267e+08
4.0     2.730674e+08
10.0    2.175180e+08
1.0     2.076232e+08
12.0    1.728805e+08
9.0     1.622652e+08
Name: grand_total, dtype: float64

# orders per month
monthly_orders = y.groupby('Month').size().sort_values(ascending=False)
print("\nOrders per Month:")
print(monthly_orders)

Orders per Month:
Month
11.0    619507
5.0      62603
3.0      61489
8.0      48514
7.0      39151
2.0      38777
6.0      34530
4.0      34091
10.0     30623
12.0     29199
1.0      26067
9.0      24024
dtype: int64

# Checking relationships between numbers in the dataset 
numeric_data = y[['price', 'qty_ordered', 'grand_total', 'discount_amount']]
correlation = numeric_data.corr()
print("Correlation Matrix:")
print(correlation)

Correlation Matrix:
                    price  qty_ordered  grand_total  discount_amount
price            1.000000    -0.005497     0.285861         0.483108
qty_ordered     -0.005497     1.000000     0.754170         0.003178
grand_total      0.285861     0.754170     1.000000         0.117417
discount_amount  0.483108     0.003178     0.117417         1.000000

# creating clean dataset for visualizations
y_viz = y[(y['status'] != 'Unknown') & 
          (y['payment_method'] != 'Unknown') & 
          (y['category_name_1'] != 'Uncategorized')]

print(f"Original dataset: {len(y):,} rows")
print(f"Clean visualization dataset: {len(y_viz):,} rows")
print(f"Removed {len(y) - len(y_viz):,} rows with Unknown/Uncategorized values")

Original dataset: 1,048,575 rows
Clean visualization dataset: 584,345 rows
Removed 464,230 rows with Unknown/Uncategorized values

# removing Unknown status first
clean_status = y[y['status'] != 'Unknown']
status_counts = clean_status['status'].value_counts()

plt.figure(figsize=(10, 6))
plt.pie(status_counts, labels=status_counts.index, autopct='%1.1f%%', startangle=90)
plt.title('Order Status Distribution', fontsize=16)
plt.show()

# removing Uncategorized category
category_clean = y[y['category_name_1'] != 'Uncategorized']

# top 10 categories by revenue
category_revenue = category_clean.groupby('category_name_1')['grand_total'].sum().sort_values(ascending=False).head(10)

plt.figure(figsize=(12, 6))
plt.bar(range(len(category_revenue)), category_revenue.values, color='steelblue')
plt.xlabel('Category', fontsize=12)
plt.ylabel('Total Revenue', fontsize=12)
plt.title('Top 10 Categories by Revenue (Excluding Uncategorized)', fontsize=16)
plt.xticks(range(len(category_revenue)), category_revenue.index, rotation=45, ha='right')
plt.tight_layout()
plt.show()

# removing Unknown payment methods
payment_clean = y[y['payment_method'] != 'Unknown']

# top payment methods
payment_counts = payment_clean['payment_method'].value_counts().head(8)

plt.figure(figsize=(10, 6))
plt.barh(range(len(payment_counts)), payment_counts.values, color='coral')
plt.ylabel('Payment Method', fontsize=12)
plt.xlabel('Number of Orders', fontsize=12)
plt.title('Top Payment Methods (Excluding Unknown)', fontsize=16)
plt.yticks(range(len(payment_counts)), payment_counts.index)
plt.tight_layout()
plt.show()

# line chart for monthly trends
monthly_revenue = y.groupby('Month')['grand_total'].sum()

plt.figure(figsize=(12, 6))
plt.plot(monthly_revenue.index, monthly_revenue.values, marker='o', linewidth=2, color='green')
plt.xlabel('Month', fontsize=12)
plt.ylabel('Total Revenue', fontsize=12)
plt.title('Monthly Revenue Trend', fontsize=16)
plt.xticks(range(1, 13))
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# comparing discount impact
discount_counts = y['has_discount'].value_counts()

plt.figure(figsize=(8, 6))
plt.bar(['No Discount', 'With Discount'], discount_counts.values, color=['lightblue', 'orange'])
plt.xlabel('Discount Status', fontsize=12)
plt.ylabel('Number of Orders', fontsize=12)
plt.title('Orders: With vs Without Discount', fontsize=16)
plt.tight_layout()
plt.show()

# stacked bar for top categories
top_categories = y['category_name_1'].value_counts().head(8).index
filtered_data = y[y['category_name_1'].isin(top_categories)]

category_status = filtered_data.groupby(['category_name_1', 'status']).size().unstack(fill_value=0)

plt.figure(figsize=(12, 6))
category_status[['complete', 'canceled']].plot(kind='bar', color=['green', 'red'], width=0.8)
plt.xlabel('Category', fontsize=12)
plt.ylabel('Number of Orders', fontsize=12)
plt.title('Completed vs Canceled Orders by Category', fontsize=16)
plt.xticks(rotation=45, ha='right')
plt.legend(['Completed', 'Canceled'])
plt.tight_layout()
plt.show()

<Figure size 1200x600 with 0 Axes>

# bar chart for yearly revenue
yearly_revenue = y.groupby('Year')['grand_total'].sum()

plt.figure(figsize=(10, 6))
plt.bar(yearly_revenue.index.astype(str), yearly_revenue.values, color='purple')
plt.xlabel('Year', fontsize=12)
plt.ylabel('Total Revenue', fontsize=12)
plt.title('Total Revenue by Year', fontsize=16)
plt.tight_layout()
plt.show()

# showing average order value
category_avg = y.groupby('category_name_1')['grand_total'].mean().sort_values(ascending=False).head(10)

plt.figure(figsize=(12, 6))
plt.barh(range(len(category_avg)), category_avg.values, color='teal')
plt.ylabel('Category', fontsize=12)
plt.xlabel('Average Order Value', fontsize=12)
plt.title('Average Order Value by Category', fontsize=16)
plt.yticks(range(len(category_avg)), category_avg.index)
plt.tight_layout()
plt.show()

	item_id	status	created_at	sku	price	qty_ordered	grand_total	increment_id	category_name_1	sales_commission_code	...	Month	Customer Since	M-Y	FY	Customer ID	Unnamed: 21	Unnamed: 22	Unnamed: 23	Unnamed: 24	Unnamed: 25
0	211131.0	complete	7/1/2016	kreations_YI 06-L	1950.0	1.0	1950.0	100147443	Women's Fashion	\N	...	7.0	2016-7	7-2016	FY17	1.0	NaN	NaN	NaN	NaN	NaN
1	211133.0	canceled	7/1/2016	kcc_Buy 2 Frey Air Freshener & Get 1 Kasual Bo...	240.0	1.0	240.0	100147444	Beauty & Grooming	\N	...	7.0	2016-7	7-2016	FY17	2.0	NaN	NaN	NaN	NaN	NaN
2	211134.0	canceled	7/1/2016	Ego_UP0017-999-MR0	2450.0	1.0	2450.0	100147445	Women's Fashion	\N	...	7.0	2016-7	7-2016	FY17	3.0	NaN	NaN	NaN	NaN	NaN
3	211135.0	complete	7/1/2016	kcc_krone deal	360.0	1.0	60.0	100147446	Beauty & Grooming	R-FSD-52352	...	7.0	2016-7	7-2016	FY17	4.0	NaN	NaN	NaN	NaN	NaN
4	211136.0	order_refunded	7/1/2016	BK7010400AG	555.0	2.0	1110.0	100147447	Soghaat	\N	...	7.0	2016-7	7-2016	FY17	5.0	NaN	NaN	NaN	NaN	NaN

	item_id	price	qty_ordered	grand_total	discount_amount	Year	Month	Customer ID	Unnamed: 21	Unnamed: 22	Unnamed: 23	Unnamed: 24	Unnamed: 25
count	584524.000000	5.845240e+05	584524.000000	5.845240e+05	584524.000000	584524.000000	584524.000000	584513.000000	0.0	0.0	0.0	0.0	0.0
mean	565667.074218	6.348748e+03	1.296388	8.530619e+03	499.492775	2017.044115	7.167654	45790.511965	NaN	NaN	NaN	NaN	NaN
std	200121.173648	1.494927e+04	3.996061	6.132081e+04	1506.943046	0.707355	3.486305	34414.962389	NaN	NaN	NaN	NaN	NaN
min	211131.000000	0.000000e+00	1.000000	-1.594000e+03	-599.500000	2016.000000	1.000000	1.000000	NaN	NaN	NaN	NaN	NaN
25%	395000.750000	3.600000e+02	1.000000	9.450000e+02	0.000000	2017.000000	4.000000	13516.000000	NaN	NaN	NaN	NaN	NaN
50%	568424.500000	8.990000e+02	1.000000	1.960400e+03	0.000000	2017.000000	7.000000	42856.000000	NaN	NaN	NaN	NaN	NaN
75%	739106.250000	4.070000e+03	1.000000	6.999000e+03	160.500000	2018.000000	11.000000	73536.000000	NaN	NaN	NaN	NaN	NaN
max	905208.000000	1.012626e+06	1000.000000	1.788800e+07	90300.000000	2018.000000	12.000000	115326.000000	NaN	NaN	NaN	NaN	NaN

Pakistan E-Commerce Sales Analysis¶

Python for Data Science - Final Project¶

Student: Kundan JAISWAL Date: 30 Jan 2026¶

Research Question¶

Dataset Description¶

Analytical Approach¶

data exploration¶

Initial Observations¶

3. Data Cleaning¶

handling all remaining columns with missing values¶

Data Cleaning Summary¶

4 . Data Analysis¶

Analysis 1: Order Status Distribution (groupby)¶

Insight: Most orders are completed, but there's a significant cancellation rate that needs attention.¶

Analysis 2: Revenue by Category (groupby)¶

Insight: Mobiles & Tablets and Women's Fashion are the top revenue generators.¶

Analysis 3: Payment Method Analysis (filter)¶

Insight: COD is most popular, showing customers prefer to pay on delivery.¶

Analysis 4: Discount Impact (filter + groupby)¶

Insight: Orders with discounts have different average values, showing promotional impact.¶

Analysis 5: Monthly Sales Trends (groupby)¶

Insight: Sales vary by month, showing seasonal patterns in customer behavior.¶

Analysis 6: Correlation Analysis (statistical)¶

Insight: Strong correlation between price and grand_total, which makes sense for our data.¶

5. Visualizations¶

Visualization 1: Order Status Distribution¶

Visualization 2: Top 10 Categories by Revenue¶

Interpretation: Mobiles & Tablets dominates revenue, followed by Entertainment and Appliances.¶

Visualization 3: Payment Method Distribution¶

Interpretation: COD (Cash on Delivery) is by far the most popular payment method, showing customers prefer to pay when receiving products.¶

Visualization 4: Monthly Sales Trend¶

Interpretation: Clear seasonal pattern with peaks in certain months, likely related to holidays or festivals.¶

Visualization 5: Orders With vs Without Discount¶

Interpretation: Most orders don't use discounts, suggesting selective promotional strategy.¶

Visualization 6: Completed vs Canceled Orders by Category¶

Interpretation: All categories have cancellations, but some categories have higher cancellation rates than others.¶

Visualization 7: Revenue Distribution by Year¶

Interpretation: Revenue growth or decline over the years shows business performance trends.¶

Visualization 8: Average Order Value by Category¶

Interpretation: Mobiles & Tablets has the highest average order value, while other categories vary significantly.¶

Conclusions¶

What factors contribute to order cancellations in e-commerce, and how can we optimize revenue by understanding customer behavior and payment preferences?¶

Key Findings¶

Quantified Insights¶

Actionable Recommendations¶

Reflection on Methodology¶

Limitations¶

Summary¶

AI Usage Log¶

Student: Kundan JAISWAL
Date: 30 Jan 2026¶