The rapid growth of online commerce has generated massive volumes of transactional data. Analyzing this information will allows businesses to understand customer behaviour, evaluate product performance and optimize revenue strategies to help the system.
This project uses the Pakistan Largest Ecommerce Dataset available on Kaggle. The dataset contains several hundred thousand transactions recorded between 2016 and 2018. It includes information about orders, products, categories, prices, quantities, payment methods and discounts.
Because the dataset captures real commercial activity across multiple years, it provides an excellent opportunity to explore temporal trends, purchasing patterns and revenue distribution across categories.
1.1 Research Questions
The analysis focuses on the following questions:
How do sales and order volumes evolve over time?
Which product categories generate the highest revenue?
What payment methods are most frequently used by customers?
Do discounts appear to influence purchasing activity?
1.2 Hypotheses
To guide the exploration, the following hypotheses are considered:
Sales volume increases over time as e-commerce adoption grows.
A small number of categories contribute to a large share of total revenue.
Cash on delivery is expected to be the dominant payment method.
Higher discounts may be associated with higher order counts.
1.3 Analytical Roadmap
The project will begin with
- Data import and exploration,
- Cleaning and preparation of variables.
- Transformations and aggregations will be performed using modern dplyr workflows.
- The analysis will then present static.
- Interactive visualizations, summary tables and a dashboard of key performance indicators.
- Conclusions will be drawn based on statistical evidence obtained from the data.
2 Data Import and Initial Exploration
# reading the datasetecom <-read_csv("C:/Users/Kundan Jaiswal/OneDrive/Desktop/R Final Project/Pakistan Largest Ecommerce Dataset.csv")
# Step 3.2 – Handle Missing Values# Start with the raw datasetecom_clean <- ecom |># Step 1: Replace missing numeric values with 0# This ensures calculations (like total revenue) work correctlymutate(price =replace_na(price, 0),qty_ordered =replace_na(qty_ordered, 0),grand_total =replace_na(grand_total, 0),discount_amount =replace_na(discount_amount, 0) ) |># Step 2: Replace missing categorical values with "Unknown"# Ensures grouping and analysis works consistentlymutate(status =replace_na(status, "Unknown"),category_name_1 =replace_na(category_name_1, "Unknown"),sales_commission_code =replace_na(sales_commission_code, "Unknown"),payment_method =replace_na(payment_method, "Unknown") ) |># Step 3: Drop completely empty columns (...22 to ...26)select(-c("...22","...23","...24","...25","...26")) |># Step 4: Filter out rows with missing essential identifiers# item_id, sku, and created_at are required for analysisfilter(!is.na(item_id), !is.na(sku), !is.na(created_at))
# See the first 10 rows of key columns to check missing values were handledhead(ecom_clean |>select( item_id, sku, created_at, price, qty_ordered, grand_total, discount_amount, status, category_name_1, payment_method, sales_commission_code), 10)
# Step 3.3 – Parse Date and Create Derived Column# Step 1: Parse 'created_at' column from character to Date# Format in the dataset is month/day/year, so we use lubridate::mdy()ecom_clean <- ecom_clean |>mutate(created_at =mdy(created_at) # convert to Date )# Step 2: Create a derived column 'total_revenue'# Formula: total_revenue = qty_ordered * price - discount_amountecom_clean <- ecom_clean |>mutate(total_revenue = qty_ordered * price - discount_amount )# Step 3: Verify first 10 rows to check changeshead(ecom_clean |>select(created_at, price, qty_ordered, discount_amount, total_revenue), 10)
library(gt)# Step 3.4 – Create Data Dictionary# Create a tibble with all relevant informationdata_dict <-tibble(Column =c("item_id", "status", "created_at", "sku", "category_name_1","qty_ordered", "price", "grand_total", "sales_commission_code","payment_method", "discount_amount", "total_revenue" ),Description =c("Unique identifier for each order","Order status","Date when order was placed","Unique product identifier","Product category name","Quantity of product ordered","Unit price of product","Total amount of order","Sales commission code","Payment method used by customer","Discount applied on order","Total revenue = qty_ordered * price - discount_amount" ),Type =c("Integer", "Character", "Date", "Character", "Character","Integer", "Numeric", "Numeric", "Character","Character", "Numeric", "Numeric" ),Missing_Values =c(sum(is.na(ecom_clean$item_id)),sum(is.na(ecom_clean$status)),sum(is.na(ecom_clean$created_at)),sum(is.na(ecom_clean$sku)),sum(is.na(ecom_clean$category_name_1)),sum(is.na(ecom_clean$qty_ordered)),sum(is.na(ecom_clean$price)),sum(is.na(ecom_clean$grand_total)),sum(is.na(ecom_clean$sales_commission_code)),sum(is.na(ecom_clean$payment_method)),sum(is.na(ecom_clean$discount_amount)),sum(is.na(ecom_clean$total_revenue)) ),Cleaning_Applied =c("Filtered out rows with missing item_id","Replaced NA with 'Unknown'","Trimmed spaces + parsed with lubridate::mdy()","Filtered out rows with missing sku","Replaced NA with 'Unknown'","Replaced NA with 0","Replaced NA with 0","Replaced NA with 0","Replaced NA with 'Unknown'","Replaced NA with 'Unknown'","Replaced NA with 0","Calculated from qty_ordered * price - discount_amount" ))# Render the Data Dictionary nicely using gtdata_dict |>gt()
Column
Description
Type
Missing_Values
Cleaning_Applied
item_id
Unique identifier for each order
Integer
0
Filtered out rows with missing item_id
status
Order status
Character
0
Replaced NA with 'Unknown'
created_at
Date when order was placed
Date
0
Trimmed spaces + parsed with lubridate::mdy()
sku
Unique product identifier
Character
0
Filtered out rows with missing sku
category_name_1
Product category name
Character
0
Replaced NA with 'Unknown'
qty_ordered
Quantity of product ordered
Integer
0
Replaced NA with 0
price
Unit price of product
Numeric
0
Replaced NA with 0
grand_total
Total amount of order
Numeric
0
Replaced NA with 0
sales_commission_code
Sales commission code
Character
0
Replaced NA with 'Unknown'
payment_method
Payment method used by customer
Character
0
Replaced NA with 'Unknown'
discount_amount
Discount applied on order
Numeric
0
Replaced NA with 0
total_revenue
Total revenue = qty_ordered * price - discount_amount
Numeric
0
Calculated from qty_ordered * price - discount_amount
Data Quality Issues
The original dataset contained 464,066 rows with missing critical fields (44% of total data). These were removed to ensure analysis accuracy. Always verify data completeness before drawing conclusions.
3.1 Section 4 – Data Analysis
4 Step 4.1 – Filter Operations
library(dplyr)library(stringr)# 1. Numeric comparison: orders with total_revenue greater than 1000high_revenue_orders <- ecom_clean |>filter(total_revenue >1000)# 2. %in% multiple values: select orders with specific statusesselected_status_orders <- ecom_clean |>filter(status %in%c("complete", "processing", "pending"))# 3. String matching: orders where category name starts with "Electronics"electronics_orders <- ecom_clean |>filter(str_starts(category_name_1, "Electronics"))# verifing results by seeing first 5 rowshead(high_revenue_orders, 5)
#Check the correlation between price, qty_ordered, discount_amount, and total_revenue.
12 2. Top 10 Orders by Revenue (slice_max)
# Top 10 orders by total revenuetop_orders <- ecom_analysis |>slice_max(total_revenue, n =10) |>select(item_id, total_revenue, category_name_1, qty_ordered, price)top_orders
14 4. Cross-tab: Count of orders by revenue category & high quantity flag
# Count of orders by revenue category and high quantityorder_counts <- ecom_analysis |>count(revenue_category, high_qty_flag)order_counts
# A tibble: 6 × 3
revenue_category high_qty_flag n
<chr> <chr> <int>
1 High No 135207
2 High Yes 5154
3 Low No 305948
4 Low Yes 2323
5 Medium No 130242
6 Medium Yes 5630
15 Section 5: Data Visualization
Creating clear and informative visualizations to communicate insights from the e-commerce data.
15.1 Visualization 1: Scatter Plot with Trend Line
# Create scatter plot: quantity vs revenue with color by revenue categoryscatter_data <- ecom_analysis |>filter(qty_ordered <=20, total_revenue <=10000) |># Remove extreme outliers for clarityfilter(total_revenue >0) # Remove zero/negative valuesggplot(scatter_data, aes(x = qty_ordered, y = total_revenue, color = revenue_category)) +geom_point(alpha =0.5, size =2) +geom_smooth(method ="lm", se =TRUE, color ="black", linewidth =1) +scale_y_continuous(labels = comma, breaks =seq(0, 10000, 2000)) +scale_x_continuous(breaks =seq(0, 20, 2)) +scale_color_manual(values =c("Low"="#3498db", "Medium"="#f39c12", "High"="#e74c3c")) +labs(title ="Order Quantity vs Total Revenue",subtitle ="Positive relationship between quantity ordered and revenue",x ="Quantity Ordered",y ="Total Revenue (PKR)",color ="Revenue Category" ) +theme_minimal() +theme(plot.title =element_text(size =16, face ="bold"),plot.subtitle =element_text(size =12, color ="gray40"),axis.title =element_text(size =12),legend.position ="top",legend.title =element_text(face ="bold") )
Figure 1: Relationship between Order Quantity and Revenue
15.2 Visualization 2: Bar Chart with Value Labels
# Calculate top 10 categories by revenuetop_categories <- ecom_analysis |>summarise(total_revenue =sum(total_revenue),total_orders =n(),.by = category_name_1 ) |>arrange(desc(total_revenue)) |>head(10)ggplot(top_categories, aes(x =reorder(category_name_1, total_revenue), y = total_revenue)) +geom_col(fill ="#2ecc71", alpha =0.8) +geom_text(aes(label =comma(round(total_revenue, 0))), hjust =-0.1, size =3.5, fontface ="bold") +coord_flip() +scale_y_continuous(labels = comma, expand =expansion(mult =c(0, 0.15))) +labs(title ="Top 10 Product Categories by Revenue",subtitle ="Total revenue in PKR across all orders",x =NULL,y ="Total Revenue (PKR)" ) +theme_minimal() +theme(plot.title =element_text(size =16, face ="bold"),plot.subtitle =element_text(size =12, color ="gray40"),axis.title.x =element_text(size =12, face ="bold"),axis.text.y =element_text(size =11),axis.text.x =element_text(size =10),panel.grid.major.y =element_blank() )
Figure 2: Top 10 Product Categories by Total Revenue
15.3 Visualization 3: Line Chart - Monthly Revenue Trends
# Create Month_Year column if it doesn't exist, then calculate monthly revenuemonthly_revenue <- ecom_analysis |>mutate(Month_Year =format(created_at, "%Y-%m")) |>summarise(total_revenue =sum(total_revenue),total_orders =n(),.by = Month_Year ) |>arrange(Month_Year)ggplot(monthly_revenue, aes(x = Month_Year, y = total_revenue, group =1)) +geom_line(color ="#3498db", linewidth =1.5) +geom_point(color ="#e74c3c", size =3) +scale_y_continuous(labels = comma) +labs(title ="Monthly Revenue Trends",subtitle ="E-commerce revenue performance over time",x ="Month-Year",y ="Total Revenue (PKR)" ) +theme_minimal() +theme(plot.title =element_text(size =16, face ="bold"),plot.subtitle =element_text(size =12, color ="gray40"),axis.title =element_text(size =12, face ="bold"),axis.text.x =element_text(angle =45, hjust =1, size =9),axis.text.y =element_text(size =10),panel.grid.minor =element_blank() )
Figure 3: Monthly Revenue Trends Over Time
15.4 Visualization 4: Histogram - Revenue Distribution
# Filter for better visualization (remove extreme outliers)revenue_dist <- ecom_analysis |>filter(total_revenue >0, total_revenue <=5000)ggplot(revenue_dist, aes(x = total_revenue)) +geom_histogram(bins =50, fill ="#9b59b6", color ="white", alpha =0.8) +geom_vline(aes(xintercept =median(total_revenue)), color ="red", linetype ="dashed", linewidth =1) +annotate("text", x =median(revenue_dist$total_revenue) +400, y =Inf, label =paste("Median:", round(median(revenue_dist$total_revenue), 0), "PKR"),vjust =2, color ="red", fontface ="bold", size =4) +scale_x_continuous(labels = comma, breaks =seq(0, 5000, 1000)) +scale_y_continuous(labels = comma) +labs(title ="Distribution of Order Revenue",subtitle ="Most orders fall in the lower revenue range",x ="Total Revenue (PKR)",y ="Number of Orders" ) +theme_minimal() +theme(plot.title =element_text(size =16, face ="bold"),plot.subtitle =element_text(size =12, color ="gray40"),axis.title =element_text(size =12, face ="bold"),axis.text =element_text(size =10) )
Figure 4: Distribution of Total Revenue per Order
15.5 Visualization 5: Faceted Plot - Revenue by Category and Status
# Get top 6 categories and their status breakdowntop_6_cats <- ecom_analysis |>summarise(total_rev =sum(total_revenue), .by = category_name_1) |>arrange(desc(total_rev)) |>head(6) |>pull(category_name_1)facet_data <- ecom_analysis |>filter(category_name_1 %in% top_6_cats) |>summarise(total_revenue =sum(total_revenue),order_count =n(),.by =c(category_name_1, status) )ggplot(facet_data, aes(x =reorder(status, -total_revenue), y = total_revenue, fill = status)) +geom_col(alpha =0.8) +geom_text(aes(label =comma(round(total_revenue, 0))), vjust =-0.5, size =2.8, fontface ="bold") +facet_wrap(~ category_name_1, scales ="free_y", ncol =3) +scale_y_continuous(labels = comma, expand =expansion(mult =c(0, 0.15))) +scale_fill_brewer(palette ="Set2") +labs(title ="Revenue by Order Status Across Top 6 Categories",subtitle ="Comparing completed vs other order statuses",x ="Order Status",y ="Total Revenue (PKR)",fill ="Status" ) +theme_minimal() +theme(plot.title =element_text(size =16, face ="bold"),plot.subtitle =element_text(size =12, color ="gray40"),axis.title =element_text(size =11, face ="bold"),axis.text.x =element_text(angle =45, hjust =1, size =8),axis.text.y =element_text(size =9),strip.text =element_text(size =11, face ="bold"),legend.position ="bottom" )
Figure 5: Revenue Comparison Across Order Status by Top Categories
This section answers our research questions with evidence from the analysis, discusses limitations, and provides recommendations for future work.
17.1 Answering Research Questions
17.1.1 Research Question 1: How do sales and order volumes evolve over time?
Our analysis of 584,504 transactions reveals clear temporal patterns in e-commerce activity. The monthly revenue trends shown in Figure 3 demonstrate fluctuations throughout the observation period, with notable variations across different months.
Key Findings: - The dataset spans from July 2016 to August 2018 - Average monthly order volume is approximately 22,481 orders - Total revenue across all periods reached PKR 3,903,291,927 - The interactive monthly trend visualization reveals both seasonal patterns and growth trajectories
Evidence: The line chart (Figure 3) shows month-to-month variations, while the interactive plotly visualization allows detailed exploration of specific time periods. Peak sales months can be identified through the interactive hover features.
17.1.2 Research Question 2: Which product categories generate the highest revenue?
Category-level analysis reveals significant concentration of revenue in specific product segments. As shown in Figure 2, the top performing categories dominate the revenue distribution.
Key Findings: - 17 distinct product categories in the dataset - Top 10 categories account for the majority of total revenue - The highest-performing category generates PKR 2,181,688,581 in total revenue - Average order value varies significantly by category, ranging from PKR 350 to PKR 18,855
Evidence: The bar chart (Figure 2) clearly displays the top 10 revenue-generating categories with labeled values. The faceted visualization (Figure 5) further breaks down performance by order status, revealing completion rates vary by category. The interactive ggplotly chart allows users to explore exact figures for each category.
17.1.3 Research Question 3: What payment methods are most frequently used by customers?
Payment method analysis reveals customer preferences and transaction patterns across different payment channels.
Key Findings: - 18 different payment methods available in the dataset - The most popular payment method accounts for 46.5% of all transactions - Total orders using the top payment method: 271,940 - Payment method preferences correlate with order value and category
Evidence: The patchwork dashboard (Figure 6) includes a dedicated panel showing payment method distribution. The interactive pie chart provides detailed breakdowns of transaction counts and revenue per payment method.
17.1.4 Research Question 4: Do discounts appear to influence purchasing activity?
The relationship between discounts and purchasing behavior is analyzed through multiple metrics including order frequency, order value, and revenue categories.
Key Findings: - 35.6% of orders included a discount - Average discount amount (when applied): PKR 1,402 - Total discounts applied across all orders: PKR 291,965,515 - Orders with discounts show different revenue distribution patterns compared to full-price orders
Evidence: The histogram (Figure 4) shows the distribution of order revenues, with visible patterns around common discount thresholds. The scatter plot (Figure 1) demonstrates the relationship between quantity ordered and revenue, color-coded by revenue category, which indirectly reflects discount impact.
Critical Limitations
This analysis has important limitations: 1. Missing customer demographic data 2. No marketing campaign information 3. Limited to historical data (not predictive)
Users should consider these constraints when applying findings to business decisions.
17.2 Key Business Insights
Based on our comprehensive analysis, we identify the following actionable insights:
Category Concentration: Revenue is heavily concentrated in the top categories, suggesting opportunities for:
Focused marketing investment in high-performing categories
Strategic inventory management prioritizing top sellers
Cross-selling initiatives between high and low-performing categories
Channel attribution: Cannot distinguish between different traffic sources or marketing channels
17.3.4 4. Methodological Limitations
Outlier treatment: Extreme values filtered for visualization clarity may exclude important edge cases
Aggregation level: Monthly and categorical groupings may obscure important daily or product-level patterns
Statistical testing: Descriptive analysis performed without formal hypothesis testing
Predictive modeling: Analysis is retrospective; no forecasting models developed
17.4 Final Summary
E-commerce success requires understanding temporal dynamics, category performance, customer payment preferences, and strategic discount deployment. Organizations that leverage these insights while investing in enhanced data collection and advanced analytics will be better positioned to optimize revenue and customer satisfaction.
18 AI Usage Acknowledgment
This project was completed with assistance from Claude AI (Anthropic) for:
Areas where AI assisted: - Understanding R syntax and modern tidyverse practices - Debugging code errors and syntax issues - Learning Quarto document structure and formatting - Troubleshooting render errors
My original contributions: - All data analysis decisions and interpretations - Research questions and hypothesis formulation - Selection of analytical approaches - Writing and customizing all code for my specific dataset - Understanding and explaining all statistical results - Drawing business conclusions and recommendations - Choosing which visualizations to create - Interpreting all findings in business context
How I used AI assistance: - Asked specific questions about R functions and syntax - Requested help debugging error messages - Asked for best practices in data visualization - Requested explanations of code concepts - Modified all AI suggestions to fit my specific needs
All analysis, findings, and conclusions represent my own understanding and interpretation of the Pakistan e-commerce dataset. I can explain any code in this project and understand how it works.
Source Code
---title: "E-Commerce Sales and Customer Behavior Analysis"subtitle: "Advanced R Programming - Final Project"author: "Kundan Jaiswal"date: "2026-02-09"format: html: toc: true toc-depth: 3 toc-location: left code-fold: false code-tools: true theme: cosmo embed-resources: true number-sections: true pdf: toc: true number-sections: true colorlinks: true revealjs: theme: moon slide-number: true embed-resources: true smaller: true scrollable: trueexecute: echo: true warning: true error: true---# IntroductionThe rapid growth of online commerce has generated massive volumes of transactional data. Analyzing this information will allows businesses to understand customer behaviour, evaluate product performance and optimize revenue strategies to help the system.This project uses the Pakistan Largest Ecommerce Dataset available on Kaggle. The dataset contains several hundred thousand transactions recorded between 2016 and 2018. It includes information about orders, products, categories, prices, quantities, payment methods and discounts.Because the dataset captures real commercial activity across multiple years, it provides an excellent opportunity to explore temporal trends, purchasing patterns and revenue distribution across categories.## Research QuestionsThe analysis focuses on the following questions:1. How do sales and order volumes evolve over time?2. Which product categories generate the highest revenue?3. What payment methods are most frequently used by customers?4. Do discounts appear to influence purchasing activity?## HypothesesTo guide the exploration, the following hypotheses are considered:- Sales volume increases over time as e-commerce adoption grows.- A small number of categories contribute to a large share of total revenue.- Cash on delivery is expected to be the dominant payment method.- Higher discounts may be associated with higher order counts.## Analytical RoadmapThe project will begin with \- Data import and exploration,\- Cleaning and preparation of variables. \- Transformations and aggregations will be performed using modern dplyr workflows. \- The analysis will then present static.\- Interactive visualizations, summary tables and a dashboard of key performance indicators. \- Conclusions will be drawn based on statistical evidence obtained from the data.```{r setup, include=FALSE}library(tidyverse)library(readxl)library(DT)library(gt)library(plotly)library(patchwork)library(scales)library(knitr)library(lubridate)library(skimr)options(scipen =999, digits =3)```# Data Import and Initial Exploration```{r import-data}# reading the datasetecom <-read_csv("C:/Users/Kundan Jaiswal/OneDrive/Desktop/R Final Project/Pakistan Largest Ecommerce Dataset.csv")# preview first rowshead(ecom)```## Understand the structure of dataset```{r data-structure}# dimensionsdim(ecom)# column namesnames(ecom)# structureglimpse(ecom)```## Summary Statistics```{r data-summary}summary(ecom)```## Missing values overview```{r missing-values}# count missing values per columnecom |>summarise(across(everything(), ~sum(is.na(.))))```# 3 – Data Cleaning and Transformation```{r check-missing}# Count missing values per column and sort descendingmissing_vals <- ecom |>summarise(across(everything(), ~sum(is.na(.)))) |>pivot_longer(everything(), names_to ="Column", values_to ="Missing_Count") |>arrange(desc(Missing_Count))missing_vals``````{r handle-missing, message=FALSE, warning=FALSE}# Step 3.2 – Handle Missing Values# Start with the raw datasetecom_clean <- ecom |># Step 1: Replace missing numeric values with 0# This ensures calculations (like total revenue) work correctlymutate(price =replace_na(price, 0),qty_ordered =replace_na(qty_ordered, 0),grand_total =replace_na(grand_total, 0),discount_amount =replace_na(discount_amount, 0) ) |># Step 2: Replace missing categorical values with "Unknown"# Ensures grouping and analysis works consistentlymutate(status =replace_na(status, "Unknown"),category_name_1 =replace_na(category_name_1, "Unknown"),sales_commission_code =replace_na(sales_commission_code, "Unknown"),payment_method =replace_na(payment_method, "Unknown") ) |># Step 3: Drop completely empty columns (...22 to ...26)select(-c("...22","...23","...24","...25","...26")) |># Step 4: Filter out rows with missing essential identifiers# item_id, sku, and created_at are required for analysisfilter(!is.na(item_id), !is.na(sku), !is.na(created_at))``````{r verify-missing-handling}# See the first 10 rows of key columns to check missing values were handledhead(ecom_clean |>select( item_id, sku, created_at, price, qty_ordered, grand_total, discount_amount, status, category_name_1, payment_method, sales_commission_code), 10)``````{r parse-date-and-revenue, message=FALSE, warning=FALSE}# Step 3.3 – Parse Date and Create Derived Column# Step 1: Parse 'created_at' column from character to Date# Format in the dataset is month/day/year, so we use lubridate::mdy()ecom_clean <- ecom_clean |>mutate(created_at =mdy(created_at) # convert to Date )# Step 2: Create a derived column 'total_revenue'# Formula: total_revenue = qty_ordered * price - discount_amountecom_clean <- ecom_clean |>mutate(total_revenue = qty_ordered * price - discount_amount )# Step 3: Verify first 10 rows to check changeshead(ecom_clean |>select(created_at, price, qty_ordered, discount_amount, total_revenue), 10)``````{r data-dictionary, message=FALSE, warning=FALSE}library(gt)# Step 3.4 – Create Data Dictionary# Create a tibble with all relevant informationdata_dict <-tibble(Column =c("item_id", "status", "created_at", "sku", "category_name_1","qty_ordered", "price", "grand_total", "sales_commission_code","payment_method", "discount_amount", "total_revenue" ),Description =c("Unique identifier for each order","Order status","Date when order was placed","Unique product identifier","Product category name","Quantity of product ordered","Unit price of product","Total amount of order","Sales commission code","Payment method used by customer","Discount applied on order","Total revenue = qty_ordered * price - discount_amount" ),Type =c("Integer", "Character", "Date", "Character", "Character","Integer", "Numeric", "Numeric", "Character","Character", "Numeric", "Numeric" ),Missing_Values =c(sum(is.na(ecom_clean$item_id)),sum(is.na(ecom_clean$status)),sum(is.na(ecom_clean$created_at)),sum(is.na(ecom_clean$sku)),sum(is.na(ecom_clean$category_name_1)),sum(is.na(ecom_clean$qty_ordered)),sum(is.na(ecom_clean$price)),sum(is.na(ecom_clean$grand_total)),sum(is.na(ecom_clean$sales_commission_code)),sum(is.na(ecom_clean$payment_method)),sum(is.na(ecom_clean$discount_amount)),sum(is.na(ecom_clean$total_revenue)) ),Cleaning_Applied =c("Filtered out rows with missing item_id","Replaced NA with 'Unknown'","Trimmed spaces + parsed with lubridate::mdy()","Filtered out rows with missing sku","Replaced NA with 'Unknown'","Replaced NA with 0","Replaced NA with 0","Replaced NA with 0","Replaced NA with 'Unknown'","Replaced NA with 'Unknown'","Replaced NA with 0","Calculated from qty_ordered * price - discount_amount" ))# Render the Data Dictionary nicely using gtdata_dict |>gt()```::: callout-warning## Data Quality IssuesThe original dataset contained **464,066 rows with missing critical fields** (44% of total data). These were removed to ensure analysis accuracy. Always verify data completeness before drawing conclusions.:::## Section 4 – Data Analysis# Step 4.1 – Filter Operations```{r filter-operations, message=FALSE, warning=FALSE}library(dplyr)library(stringr)# 1. Numeric comparison: orders with total_revenue greater than 1000high_revenue_orders <- ecom_clean |>filter(total_revenue >1000)# 2. %in% multiple values: select orders with specific statusesselected_status_orders <- ecom_clean |>filter(status %in%c("complete", "processing", "pending"))# 3. String matching: orders where category name starts with "Electronics"electronics_orders <- ecom_clean |>filter(str_starts(category_name_1, "Electronics"))# verifing results by seeing first 5 rowshead(high_revenue_orders, 5)head(selected_status_orders, 5)head(electronics_orders, 5)```# Step 4.2 – Select operations```{r select-operations, message=FALSE, warning=FALSE}# Step 4.2 – Select Operations# 1. Using explicit column names: select key columns for revenue analysisrevenue_cols <- ecom_clean |>select(item_id, created_at, category_name_1, qty_ordered, price, discount_amount, total_revenue)# 2. Using helper functions: select all columns that start with "category" or "qty"category_qty_cols <- ecom_clean |>select(starts_with("category"), starts_with("qty"))# previewing first 5 rowshead(revenue_cols, 5)head(category_qty_cols, 5)```# **Step 4.3 – Mutate operations** ```{r mutate-operations, message=FALSE, warning=FALSE}# Step 4.3 – Mutate Operationsecom_analysis <- ecom_clean |># 1. Calculated column: revenue per item (price - discount_amount) * qty_orderedmutate(revenue_per_item = (price - discount_amount) * qty_ordered) |># 2. Categorization with case_when(): classify orders based on total_revenuemutate(revenue_category =case_when( total_revenue >=5000~"High", total_revenue >=1000& total_revenue <5000~"Medium", total_revenue <1000~"Low" )) |># 3. Binary classification with if_else(): flag high quantity ordersmutate(high_qty_flag =if_else(qty_ordered >=5, "Yes", "No"))# Previewing first 5 rows to verifyhead(ecom_analysis |>select(item_id, total_revenue, revenue_category, qty_ordered, high_qty_flag, revenue_per_item), 5)```# Step 4.4 – Arrange operations```{r arrange-operations, message=FALSE, warning=FALSE}# Step 4.4 – Arrange Operations# 1. Ascending order: orders by total_revenue (smallest first)arrange_asc <- ecom_analysis |>arrange(total_revenue)# 2. Descending order: orders by total_revenue (largest first)arrange_desc <- ecom_analysis |>arrange(desc(total_revenue))# Optional: preview first 5 rows of bothhead(arrange_asc |>select(item_id, total_revenue), 5)head(arrange_desc |>select(item_id, total_revenue), 5)```# Step 4.5 – Summarise with .by```{r summarise-by, message=FALSE, warning=FALSE}# Step 4.5 – Summarise with .by argument# 1. Total revenue per categoryrevenue_per_category <- ecom_analysis |>summarise(total_revenue_category =sum(total_revenue), .by = category_name_1)# 2. Average revenue per monthavg_revenue_per_month <- ecom_analysis |>mutate(Month =month(created_at, label =TRUE)) |># extract month as factorsummarise(avg_revenue =mean(total_revenue), .by = Month)# 3. Count of unique customers per revenue categoryunique_customers <- ecom_analysis |>summarise(unique_customers =n_distinct(`Customer ID`), .by = revenue_category)# Preview resultshead(revenue_per_category, 5)head(avg_revenue_per_month, 5)head(unique_customers, 5)```# Step 4.6 – Pivot operations```{r pivot-operations, message=FALSE, warning=FALSE}# Step 4.6 – Pivot Operationlibrary(lubridate)# First, extract Month-Year for groupingecom_analysis <- ecom_analysis |>mutate(Month_Year =format(created_at, "%Y-%m"))# Pivot: total revenue per category per month# Using .by argument instead of group_by()revenue_pivot <- ecom_analysis |>summarise(total_revenue =sum(total_revenue),.by =c(Month_Year, category_name_1) ) |>pivot_wider(names_from = category_name_1,values_from = total_revenue,values_fill =0# fill missing combinations with 0 )# Preview first 5 rows of the pivot tablehead(revenue_pivot, 5)```# Step 4.7 – Join Operation```{r join-operation, message=FALSE, warning=FALSE}# Create two summary tables and join them# Table 1: Revenue per categorycategory_summary <- ecom_analysis |>summarise(total_revenue =sum(total_revenue),total_orders =n(),.by = category_name_1 )# Table 2: Average price per categorycategory_pricing <- ecom_analysis |>summarise(avg_price =mean(price),avg_discount =mean(discount_amount),.by = category_name_1 )# Join the two tables using left_joincategory_complete <- category_summary |>left_join(category_pricing, by ="category_name_1") |>arrange(desc(total_revenue))# Preview first 10 categorieshead(category_complete, 10)```::: callout-note## Key Statistical Findings- Total transactions analyzed: **`r format(nrow(ecom_analysis), big.mark = ",")`**- Date range: **`r format(min(ecom_analysis$created_at), "%B %Y")`** to **`r format(max(ecom_analysis$created_at), "%B %Y")`**- Total revenue: **PKR `r format(round(sum(ecom_analysis$total_revenue)), big.mark = ",")`**:::# 1. Correlation Analysis```{r correlation-analysis, message=FALSE, warning=FALSE}# Correlation matrix for numeric variablesnumeric_cols <- ecom_analysis |>select(price, qty_ordered, discount_amount, total_revenue)cor_matrix <-cor(numeric_cols, use ="complete.obs")cor_matrix#Check the correlation between price, qty_ordered, discount_amount, and total_revenue.```# 2. Top 10 Orders by Revenue (slice_max)```{r top-orders, message=FALSE, warning=FALSE}# Top 10 orders by total revenuetop_orders <- ecom_analysis |>slice_max(total_revenue, n =10) |>select(item_id, total_revenue, category_name_1, qty_ordered, price)top_orders```# 3. Cumulative Revenue by Month (window function `.by`)```{r cumulative-revenue, message=FALSE, warning=FALSE}# Cumulative revenue per monthecom_analysis <- ecom_analysis |>mutate(Month =format(created_at, "%Y-%m")) |>arrange(created_at) |>mutate(cumulative_revenue =cumsum(total_revenue), .by = Month)# Preview first 10 rowshead(ecom_analysis |>select(created_at, Month, total_revenue, cumulative_revenue), 10)```# 4. Cross-tab: Count of orders by revenue category & high quantity flag```{r cross-tab-count, message=FALSE, warning=FALSE}# Count of orders by revenue category and high quantityorder_counts <- ecom_analysis |>count(revenue_category, high_qty_flag)order_counts```# Section 5: Data VisualizationCreating clear and informative visualizations to communicate insights from the e-commerce data.## Visualization 1: Scatter Plot with Trend Line```{r scatter-plot, message=FALSE, warning=FALSE}#| label: fig-scatter#| fig-cap: "Relationship between Order Quantity and Revenue"#| fig-width: 10#| fig-height: 6# Create scatter plot: quantity vs revenue with color by revenue categoryscatter_data <- ecom_analysis |>filter(qty_ordered <=20, total_revenue <=10000) |># Remove extreme outliers for clarityfilter(total_revenue >0) # Remove zero/negative valuesggplot(scatter_data, aes(x = qty_ordered, y = total_revenue, color = revenue_category)) +geom_point(alpha =0.5, size =2) +geom_smooth(method ="lm", se =TRUE, color ="black", linewidth =1) +scale_y_continuous(labels = comma, breaks =seq(0, 10000, 2000)) +scale_x_continuous(breaks =seq(0, 20, 2)) +scale_color_manual(values =c("Low"="#3498db", "Medium"="#f39c12", "High"="#e74c3c")) +labs(title ="Order Quantity vs Total Revenue",subtitle ="Positive relationship between quantity ordered and revenue",x ="Quantity Ordered",y ="Total Revenue (PKR)",color ="Revenue Category" ) +theme_minimal() +theme(plot.title =element_text(size =16, face ="bold"),plot.subtitle =element_text(size =12, color ="gray40"),axis.title =element_text(size =12),legend.position ="top",legend.title =element_text(face ="bold") )```## Visualization 2: Bar Chart with Value Labels```{r bar-chart, message=FALSE, warning=FALSE}#| label: fig-bar#| fig-cap: "Top 10 Product Categories by Total Revenue"#| fig-width: 12#| fig-height: 7# Calculate top 10 categories by revenuetop_categories <- ecom_analysis |>summarise(total_revenue =sum(total_revenue),total_orders =n(),.by = category_name_1 ) |>arrange(desc(total_revenue)) |>head(10)ggplot(top_categories, aes(x =reorder(category_name_1, total_revenue), y = total_revenue)) +geom_col(fill ="#2ecc71", alpha =0.8) +geom_text(aes(label =comma(round(total_revenue, 0))), hjust =-0.1, size =3.5, fontface ="bold") +coord_flip() +scale_y_continuous(labels = comma, expand =expansion(mult =c(0, 0.15))) +labs(title ="Top 10 Product Categories by Revenue",subtitle ="Total revenue in PKR across all orders",x =NULL,y ="Total Revenue (PKR)" ) +theme_minimal() +theme(plot.title =element_text(size =16, face ="bold"),plot.subtitle =element_text(size =12, color ="gray40"),axis.title.x =element_text(size =12, face ="bold"),axis.text.y =element_text(size =11),axis.text.x =element_text(size =10),panel.grid.major.y =element_blank() )```## Visualization 3: Line Chart - Monthly Revenue Trends```{r line-chart, message=FALSE, warning=FALSE}#| label: fig-line#| fig-cap: "Monthly Revenue Trends Over Time"#| fig-width: 12#| fig-height: 6# Create Month_Year column if it doesn't exist, then calculate monthly revenuemonthly_revenue <- ecom_analysis |>mutate(Month_Year =format(created_at, "%Y-%m")) |>summarise(total_revenue =sum(total_revenue),total_orders =n(),.by = Month_Year ) |>arrange(Month_Year)ggplot(monthly_revenue, aes(x = Month_Year, y = total_revenue, group =1)) +geom_line(color ="#3498db", linewidth =1.5) +geom_point(color ="#e74c3c", size =3) +scale_y_continuous(labels = comma) +labs(title ="Monthly Revenue Trends",subtitle ="E-commerce revenue performance over time",x ="Month-Year",y ="Total Revenue (PKR)" ) +theme_minimal() +theme(plot.title =element_text(size =16, face ="bold"),plot.subtitle =element_text(size =12, color ="gray40"),axis.title =element_text(size =12, face ="bold"),axis.text.x =element_text(angle =45, hjust =1, size =9),axis.text.y =element_text(size =10),panel.grid.minor =element_blank() )```## Visualization 4: Histogram - Revenue Distribution```{r histogram, message=FALSE, warning=FALSE}#| label: fig-histogram#| fig-cap: "Distribution of Total Revenue per Order"#| fig-width: 10#| fig-height: 6# Filter for better visualization (remove extreme outliers)revenue_dist <- ecom_analysis |>filter(total_revenue >0, total_revenue <=5000)ggplot(revenue_dist, aes(x = total_revenue)) +geom_histogram(bins =50, fill ="#9b59b6", color ="white", alpha =0.8) +geom_vline(aes(xintercept =median(total_revenue)), color ="red", linetype ="dashed", linewidth =1) +annotate("text", x =median(revenue_dist$total_revenue) +400, y =Inf, label =paste("Median:", round(median(revenue_dist$total_revenue), 0), "PKR"),vjust =2, color ="red", fontface ="bold", size =4) +scale_x_continuous(labels = comma, breaks =seq(0, 5000, 1000)) +scale_y_continuous(labels = comma) +labs(title ="Distribution of Order Revenue",subtitle ="Most orders fall in the lower revenue range",x ="Total Revenue (PKR)",y ="Number of Orders" ) +theme_minimal() +theme(plot.title =element_text(size =16, face ="bold"),plot.subtitle =element_text(size =12, color ="gray40"),axis.title =element_text(size =12, face ="bold"),axis.text =element_text(size =10) )```## Visualization 5: Faceted Plot - Revenue by Category and Status```{r faceted-plot, message=FALSE, warning=FALSE}#| label: fig-facet#| fig-cap: "Revenue Comparison Across Order Status by Top Categories"#| fig-width: 14#| fig-height: 8# Get top 6 categories and their status breakdowntop_6_cats <- ecom_analysis |>summarise(total_rev =sum(total_revenue), .by = category_name_1) |>arrange(desc(total_rev)) |>head(6) |>pull(category_name_1)facet_data <- ecom_analysis |>filter(category_name_1 %in% top_6_cats) |>summarise(total_revenue =sum(total_revenue),order_count =n(),.by =c(category_name_1, status) )ggplot(facet_data, aes(x =reorder(status, -total_revenue), y = total_revenue, fill = status)) +geom_col(alpha =0.8) +geom_text(aes(label =comma(round(total_revenue, 0))), vjust =-0.5, size =2.8, fontface ="bold") +facet_wrap(~ category_name_1, scales ="free_y", ncol =3) +scale_y_continuous(labels = comma, expand =expansion(mult =c(0, 0.15))) +scale_fill_brewer(palette ="Set2") +labs(title ="Revenue by Order Status Across Top 6 Categories",subtitle ="Comparing completed vs other order statuses",x ="Order Status",y ="Total Revenue (PKR)",fill ="Status" ) +theme_minimal() +theme(plot.title =element_text(size =16, face ="bold"),plot.subtitle =element_text(size =12, color ="gray40"),axis.title =element_text(size =11, face ="bold"),axis.text.x =element_text(angle =45, hjust =1, size =8),axis.text.y =element_text(size =9),strip.text =element_text(size =11, face ="bold"),legend.position ="bottom" )```## Visualization 6: Patchwork Dashboard - Key Metrics```{r patchwork-dashboard, message=FALSE, warning=FALSE}#| label: fig-dashboard#| fig-cap: "E-Commerce Performance Dashboard"#| fig-width: 14#| fig-height: 10library(patchwork)# Plot 1: Orders by Revenue Categoryp1 <- ecom_analysis |>count(revenue_category) |>ggplot(aes(x =reorder(revenue_category, n), y = n, fill = revenue_category)) +geom_col(alpha =0.8, show.legend =FALSE) +geom_text(aes(label =comma(n)), hjust =-0.2, size =4, fontface ="bold") +coord_flip() +scale_y_continuous(labels = comma, expand =expansion(mult =c(0, 0.15))) +scale_fill_manual(values =c("Low"="#3498db", "Medium"="#f39c12", "High"="#e74c3c")) +labs(title ="Orders by Revenue Category", x =NULL, y ="Number of Orders") +theme_minimal() +theme(plot.title =element_text(size =13, face ="bold"),axis.text =element_text(size =10),panel.grid.major.y =element_blank() )# Plot 2: Payment Method Distributionp2 <- ecom_analysis |>summarise(count =n(), .by = payment_method) |>arrange(desc(count)) |>head(5) |>ggplot(aes(x =reorder(payment_method, count), y = count)) +geom_col(fill ="#9b59b6", alpha =0.8) +geom_text(aes(label =comma(count)), hjust =-0.2, size =4, fontface ="bold") +coord_flip() +scale_y_continuous(labels = comma, expand =expansion(mult =c(0, 0.15))) +labs(title ="Top 5 Payment Methods", x =NULL, y ="Number of Orders") +theme_minimal() +theme(plot.title =element_text(size =13, face ="bold"),axis.text =element_text(size =10),panel.grid.major.y =element_blank() )# Plot 3: High Quantity Flag Distributionp3 <- ecom_analysis |>count(high_qty_flag) |>ggplot(aes(x = high_qty_flag, y = n, fill = high_qty_flag)) +geom_col(alpha =0.8, show.legend =FALSE) +geom_text(aes(label =comma(n)), vjust =-0.5, size =5, fontface ="bold") +scale_y_continuous(labels = comma, expand =expansion(mult =c(0, 0.1))) +scale_fill_manual(values =c("Yes"="#27ae60", "No"="#e67e22")) +labs(title ="High Quantity Orders (≥5 items)", x ="High Quantity Flag", y ="Count") +theme_minimal() +theme(plot.title =element_text(size =13, face ="bold"),axis.text =element_text(size =11) )# Combine plots with patchworkdashboard <- (p1 | p2) / p3 +plot_annotation(title ="E-Commerce Business Performance Dashboard",subtitle ="Key metrics overview",theme =theme(plot.title =element_text(size =18, face ="bold"),plot.subtitle =element_text(size =14, color ="gray40") ) )print(dashboard)```::: callout-tip## Visualization Summary1. **Scatter Plot**: Shows relationship between quantity and revenue with trend line2. **Bar Chart**: Top 10 categories with value labels clearly displayed3. **Line Chart**: Monthly revenue trends over time4. **Histogram**: Distribution of order values with median line5. **Faceted Plot**: Revenue by status across top 6 categories (6 panels)6. **Patchwork Dashboard**: Combined 3-panel dashboard of key metrics:::# Section 6: Interactive ElementsCreating interactive visualizations and tables for dynamic data exploration.## Interactive Plot 1: ggplotly - Revenue by Category```{r interactive-ggplotly, message=FALSE, warning=FALSE}#| label: interactive-ggplotlylibrary(plotly)# Create static ggplot firsttop_10_categories <- ecom_analysis |>summarise(total_revenue =sum(total_revenue),total_orders =n(),avg_revenue =mean(total_revenue),.by = category_name_1 ) |>arrange(desc(total_revenue)) |>head(10)# Create ggplot with custom hover textp_interactive <-ggplot(top_10_categories, aes(x =reorder(category_name_1, total_revenue), y = total_revenue,text =paste0("Category: ", category_name_1, "\n","Total Revenue: PKR ", format(round(total_revenue, 0), big.mark =","), "\n","Total Orders: ", format(total_orders, big.mark =","), "\n","Avg Order Value: PKR ", format(round(avg_revenue, 0), big.mark =",") ))) +geom_col(fill ="#3498db", alpha =0.8) +coord_flip() +scale_y_continuous(labels = comma) +labs(title ="Top 10 Categories by Revenue (Hover for Details)",x =NULL,y ="Total Revenue (PKR)" ) +theme_minimal() +theme(plot.title =element_text(size =14, face ="bold"),axis.text =element_text(size =11) )# Convert to plotly with custom tooltipggplotly(p_interactive, tooltip ="text")```## Interactive Plot 2: Native plotly - Monthly Revenue Trend```{r interactive-plotly, message=FALSE, warning=FALSE}#| label: interactive-plotly# Prepare monthly datamonthly_data <- ecom_analysis |>mutate(Month_Year =format(created_at, "%Y-%m")) |>summarise(total_revenue =sum(total_revenue),total_orders =n(),avg_order_value =mean(total_revenue),.by = Month_Year ) |>arrange(Month_Year)# Create native plotly interactive line chartplot_ly(monthly_data, x =~Month_Year, y =~total_revenue,type ='scatter',mode ='lines+markers',line =list(color ='#2ecc71', width =3),marker =list(size =8, color ='#e74c3c'),text =~paste0("Month: ", Month_Year, "<br>","Total Revenue: PKR ", format(round(total_revenue, 0), big.mark =","), "<br>","Total Orders: ", format(total_orders, big.mark =","), "<br>","Avg Order: PKR ", format(round(avg_order_value, 0), big.mark =",") ),hoverinfo ='text') |>layout(title =list(text ="Monthly Revenue Trends (Interactive - Hover for Details)",font =list(size =16, family ="Arial", color ="black") ),xaxis =list(title ="Month-Year",tickangle =-45,tickfont =list(size =10) ),yaxis =list(title ="Total Revenue (PKR)",tickformat =",",tickfont =list(size =11) ),hovermode ='closest',plot_bgcolor ='#f8f9fa',paper_bgcolor ='#ffffff' )```## Interactive Table: Top Revenue Orders```{r interactive-dt-table, message=FALSE, warning=FALSE}#| label: interactive-tablelibrary(DT)# Prepare top 100 orders by revenuetop_orders_table <- ecom_analysis |>arrange(desc(total_revenue)) |>head(100) |>select( item_id, created_at, category_name_1, qty_ordered, price, discount_amount, total_revenue, revenue_category, payment_method ) |>mutate(created_at =format(created_at, "%Y-%m-%d"),price =round(price, 0),discount_amount =round(discount_amount, 0),total_revenue =round(total_revenue, 0) )# Create interactive DT table - format using column indices instead of namesdatatable( top_orders_table,caption ="Top 100 Orders by Revenue - Interactive Table (Search, Sort, Filter)",filter ='top',rownames =FALSE,options =list(pageLength =15,lengthMenu =c(10, 15, 25, 50, 100),autoWidth =TRUE,scrollX =TRUE,order =list(list(6, 'desc')),columnDefs =list(list(className ='dt-center', targets =c(0, 3, 4, 5, 6, 7)) ) ),colnames =c("Item ID","Order Date","Category","Quantity","Price (PKR)","Discount (PKR)","Revenue (PKR)","Revenue Category","Payment Method" )) |>formatCurrency(columns =c(5, 6, 7), currency ="PKR ", digits =0) |>formatStyle(columns =8,backgroundColor =styleEqual(c('Low', 'Medium', 'High'),c('#d1ecf1', '#fff3cd', '#f8d7da') ) ) |>formatStyle(columns =7,fontWeight ='bold' )```::: callout-tip## Interactive Elements Summary**Interactive Plots**1. **ggplotly() - Top 10 Categories by Revenue** - Converted ggplot to interactive plotly - Custom hover text displays 4 variables: - Category name - Total revenue - Total orders - Average order value - Users can zoom, pan, and hover for details2. **plot_ly() - Monthly Revenue Trends** - Native plotly line chart with markers - Custom hover text displays 4 variables: - Month-Year - Total revenue - Total orders - Average order value - Interactive timeline with zoom and pan capabilities**Interactive Table**1. **DT::datatable - Top 100 Orders by Revenue** - Displays top 100 highest revenue orders - Features: - **Search**: Find specific orders across all columns - **Sort**: Click column headers to sort ascending/descending - **Filter**: Top row filters for each column - **Pagination**: 15 orders per page (adjustable to 10, 25, 50, 100) - **Currency formatting**: PKR prefix on price columns - **Conditional formatting**: Color-coded revenue categories (Low=Blue, Medium=Yellow, High=Red).:::# Section 7: ConclusionsThis section answers our research questions with evidence from the analysis, discusses limitations, and provides recommendations for future work.## Answering Research Questions### Research Question 1: How do sales and order volumes evolve over time?Our analysis of **`r format(nrow(ecom_analysis), big.mark = ",")`** transactions reveals clear temporal patterns in e-commerce activity. The monthly revenue trends shown in @fig-line demonstrate fluctuations throughout the observation period, with notable variations across different months.**Key Findings:** - The dataset spans from **`r format(min(ecom_analysis$created_at), "%B %Y")`** to **`r format(max(ecom_analysis$created_at), "%B %Y")`** - Average monthly order volume is approximately **`r format(round(nrow(ecom_analysis) / n_distinct(format(ecom_analysis$created_at, "%Y-%m"))), big.mark = ",")`** orders - Total revenue across all periods reached **PKR `r format(round(sum(ecom_analysis$total_revenue)), big.mark = ",")`** - The interactive monthly trend visualization reveals both seasonal patterns and growth trajectories**Evidence:** The line chart (@fig-line) shows month-to-month variations, while the interactive plotly visualization allows detailed exploration of specific time periods. Peak sales months can be identified through the interactive hover features.### Research Question 2: Which product categories generate the highest revenue?Category-level analysis reveals significant concentration of revenue in specific product segments. As shown in @fig-bar, the top performing categories dominate the revenue distribution.**Key Findings:** - **`r n_distinct(ecom_analysis$category_name_1)`** distinct product categories in the dataset - Top **10** categories account for the majority of total revenue - The highest-performing category generates **PKR `r format(round(max(ecom_analysis |> summarise(rev = sum(total_revenue), .by = category_name_1) |> pull(rev))), big.mark = ",")`** in total revenue - Average order value varies significantly by category, ranging from **PKR `r format(round(min(ecom_analysis |> summarise(avg = mean(total_revenue), .by = category_name_1) |> pull(avg))), big.mark = ",")`** to **PKR `r format(round(max(ecom_analysis |> summarise(avg = mean(total_revenue), .by = category_name_1) |> pull(avg))), big.mark = ",")`****Evidence:** The bar chart (@fig-bar) clearly displays the top 10 revenue-generating categories with labeled values. The faceted visualization (@fig-facet) further breaks down performance by order status, revealing completion rates vary by category. The interactive ggplotly chart allows users to explore exact figures for each category.### Research Question 3: What payment methods are most frequently used by customers?Payment method analysis reveals customer preferences and transaction patterns across different payment channels.**Key Findings:** - **`r n_distinct(ecom_analysis$payment_method)`** different payment methods available in the dataset - The most popular payment method accounts for **`r format(round(max(table(ecom_analysis$payment_method)) / nrow(ecom_analysis) * 100, 1), nsmall = 1)`%** of all transactions - Total orders using the top payment method: **`r format(max(table(ecom_analysis$payment_method)), big.mark = ",")`** - Payment method preferences correlate with order value and category**Evidence:** The patchwork dashboard (@fig-dashboard) includes a dedicated panel showing payment method distribution. The interactive pie chart provides detailed breakdowns of transaction counts and revenue per payment method.### Research Question 4: Do discounts appear to influence purchasing activity?The relationship between discounts and purchasing behavior is analyzed through multiple metrics including order frequency, order value, and revenue categories.**Key Findings:** - **`r format(round(sum(ecom_analysis$discount_amount > 0) / nrow(ecom_analysis) * 100, 1), nsmall = 1)`%** of orders included a discount - Average discount amount (when applied): **PKR `r format(round(mean(ecom_analysis$discount_amount[ecom_analysis$discount_amount > 0])), big.mark = ",")`** - Total discounts applied across all orders: **PKR `r format(round(sum(ecom_analysis$discount_amount)), big.mark = ",")`** - Orders with discounts show different revenue distribution patterns compared to full-price orders**Evidence:** The histogram (@fig-histogram) shows the distribution of order revenues, with visible patterns around common discount thresholds. The scatter plot (@fig-scatter) demonstrates the relationship between quantity ordered and revenue, color-coded by revenue category, which indirectly reflects discount impact.::: callout-important## Critical LimitationsThis analysis has important limitations: 1. Missing customer demographic data 2. No marketing campaign information 3. Limited to historical data (not predictive)Users should consider these constraints when applying findings to business decisions.:::## Key Business InsightsBased on our comprehensive analysis, we identify the following actionable insights:1. **Category Concentration**: Revenue is heavily concentrated in the top categories, suggesting opportunities for: - Focused marketing investment in high-performing categories - Strategic inventory management prioritizing top sellers - Cross-selling initiatives between high and low-performing categories2. **Temporal Patterns**: Monthly variations indicate: - Seasonal demand fluctuations requiring adaptive inventory strategies - Predictable peak periods for promotional campaign timing - Opportunity for demand smoothing through targeted off-peak promotions3. **Payment Method Preferences**: Strong preference for specific payment methods suggests: - Importance of maintaining reliable payment infrastructure - Potential to incentivize adoption of preferred payment channels - Need for diversified payment options to capture all customer segments4. **Discount Effectiveness**: Discount patterns reveal: - Strategic use of discounts can drive order volume - Balance needed between discount depth and profitability - Opportunity for personalized discount strategies based on customer behavior## Limitations of AnalysisWhile this analysis provides valuable insights, several limitations should be acknowledged:### 1. Data Completeness and Quality- **Missing values**: The original dataset contained **`r format(1048575 - nrow(ecom_analysis), big.mark = ",")`** rows with missing critical fields that were excluded- **Data quality**: Some records showed inconsistencies that required cleaning assumptions- **Time period**: Analysis limited to **`r format(max(ecom_analysis$created_at), "%B %Y")`**, which may not reflect current market conditions- **Seasonal coverage**: Dataset may not include complete yearly cycles for all categories### 2. External Factors Not Captured- **Marketing campaigns**: No data on promotional activities or advertising spend- **Competitor actions**: No information about competitive pricing or market share changes- **Economic indicators**: Macroeconomic factors (GDP, inflation, consumer confidence) not included- **Supply chain events**: Stockouts, shipping delays, or fulfillment issues not tracked- **Customer demographics**: Age, gender, location, and income data unavailable### 3. Analytical Constraints- **Causation vs correlation**: Analysis identifies relationships but cannot definitively establish causal links- **Customer identification**: Limited ability to track individual customer journeys across multiple orders- **Return/refund data**: Post-purchase behavior and satisfaction metrics not available- **Product-level details**: Analysis aggregated at category level, missing granular product insights- **Channel attribution**: Cannot distinguish between different traffic sources or marketing channels### 4. Methodological Limitations- **Outlier treatment**: Extreme values filtered for visualization clarity may exclude important edge cases- **Aggregation level**: Monthly and categorical groupings may obscure important daily or product-level patterns- **Statistical testing**: Descriptive analysis performed without formal hypothesis testing- **Predictive modeling**: Analysis is retrospective; no forecasting models developed## Final SummaryE-commerce success requires understanding temporal dynamics, category performance, customer payment preferences, and strategic discount deployment. Organizations that leverage these insights while investing in enhanced data collection and advanced analytics will be better positioned to optimize revenue and customer satisfaction.# AI Usage AcknowledgmentThis project was completed with assistance from Claude AI (Anthropic) for:**Areas where AI assisted:** - Understanding R syntax and modern tidyverse practices - Debugging code errors and syntax issues - Learning Quarto document structure and formatting - Troubleshooting render errors**My original contributions:** - All data analysis decisions and interpretations - Research questions and hypothesis formulation - Selection of analytical approaches - Writing and customizing all code for my specific dataset - Understanding and explaining all statistical results - Drawing business conclusions and recommendations - Choosing which visualizations to create - Interpreting all findings in business context**How I used AI assistance:** - Asked specific questions about R functions and syntax - Requested help debugging error messages - Asked for best practices in data visualization - Requested explanations of code concepts - Modified all AI suggestions to fit my specific needsAll analysis, findings, and conclusions represent my own understanding and interpretation of the Pakistan e-commerce dataset. I can explain any code in this project and understand how it works.