Data Analysis

Olist Brazillian E-Commerce: Data Analysis

Summary
Analysis of the Olist E-Commerce dataset to uncover sales trends and customer-seller distribution using Python.
Timeline
Dec 2025
Tools
#python #pandas #matplotlib

Overview

Brazilian E-Commerce Public Dataset by Olist contains real anonymized commercial data of orders made at the Olist Store. The dataset includes information on around 100k orders from 2016 to 2018, made across multiple marketplaces in Brazil. It consist of

It consists of 9 CSV files with the following structure:

  1. olist_customers_dataset.csv
  2. olist_geolocation_dataset.csv
  3. olist_order_items_dataset.csv
  4. olist_order_payments_dataset.csv
  5. olist_order_reviews_dataset.csv
  6. olist_orders_dataset.csv
  7. olist_products_dataset.csv
  8. olist_sellers_dataset.csv
  9. product_category_name_translation.csv

Goals / Objectives

The goals of this analysis is to answer these business questions:

  1. Sales Trend / Time Series Analysis
    • How do order counts change over time?
    • How much revenue is generated (including and excluding freight)?
    • What is the Average Order Value (AOV) over time?
  2. Customer and Seller Segmentation
    • How are customers and sellers distributed across different regions?
  3. Sales Prediction
    • How many orders are expected in the next quarter?
    • How much revenue is expected to be generated in the next quarter?

Methods & Tools

Data Preparation

Each dataset was loaded into a separate DataFrame. However, only the datasets required to answer the business questions were preprocessed. Therefore, the data preparation process focused on the following four tables:

  1. Customers
  2. Sellers
  3. Orders
  4. Order Items

For each table, the following preprocessing steps were performed:

  1. Checking for missing values using isnull().sum()
orders_df.isnull().sum()
  1. Checking for duplicate records using duplicated().sum()
orders_df.duplicated().sum()
  1. Reviewing data types with info()
orders_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 8 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   order_id                       99441 non-null  object
 1   customer_id                    99441 non-null  object
 2   order_status                   99441 non-null  object
 3   order_purchase_timestamp       99441 non-null  object
 4   order_approved_at              99281 non-null  object
 5   order_delivered_carrier_date   97658 non-null  object
 6   order_delivered_customer_date  96476 non-null  object
 7   order_estimated_delivery_date  99441 non-null  object
dtypes: object(8)
memory usage: 6.1+ MB
  1. Converting data types when necessary to ensure they were suitable for analysis
orders_df['order_purchase_timestamp'] = pd.to_datetime(orders_df['order_purchase_timestamp'], errors="coerce")
orders_df['order_approved_at'] = pd.to_datetime(orders_df['order_approved_at'], errors="coerce")
orders_df['order_delivered_carrier_date'] = pd.to_datetime(orders_df['order_delivered_carrier_date'], errors="coerce")
orders_df['order_delivered_customer_date'] = pd.to_datetime(orders_df['order_delivered_customer_date'], errors="coerce")
orders_df['order_estimated_delivery_date'] = pd.to_datetime(orders_df['order_estimated_delivery_date'], errors="coerce")

orders_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 8 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   order_id                       99441 non-null  object        
 1   customer_id                    99441 non-null  object        
 2   order_status                   99441 non-null  object        
 3   order_purchase_timestamp       99441 non-null  datetime64[ns]
 4   order_approved_at              99281 non-null  datetime64[ns]
 5   order_delivered_carrier_date   97658 non-null  datetime64[ns]
 6   order_delivered_customer_date  96476 non-null  datetime64[ns]
 7   order_estimated_delivery_date  99441 non-null  datetime64[ns]
dtypes: datetime64[ns](5), object(3)
memory usage: 6.1+ MB

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5
Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

  1. Item 1
  2. Item 2
  3. Item 3

Unordered list

Text link

Bold text

Emphasis

Superscript

Subscript

ini caption
1import tensorflow
2
3print("this is super super super super super super super super super super long text code")