Overview
Brazilian E-Commerce Public Dataset by Olist contains real anonymized commercial data of orders made at the Olist Store. The dataset includes information on around 100k orders from 2016 to 2018, made across multiple marketplaces in Brazil. It consist of
It consists of 9 CSV files with the following structure:
- olist_customers_dataset.csv
- olist_geolocation_dataset.csv
- olist_order_items_dataset.csv
- olist_order_payments_dataset.csv
- olist_order_reviews_dataset.csv
- olist_orders_dataset.csv
- olist_products_dataset.csv
- olist_sellers_dataset.csv
- product_category_name_translation.csv

Goals / Objectives
The goals of this analysis is to answer these business questions:
- Sales Trend / Time Series Analysis
- How do order counts change over time?
- How much revenue is generated (including and excluding freight)?
- What is the Average Order Value (AOV) over time?
- Customer and Seller Segmentation
- How are customers and sellers distributed across different regions?
- Sales Prediction
- How many orders are expected in the next quarter?
- How much revenue is expected to be generated in the next quarter?
Methods & Tools
Data Preparation
Each dataset was loaded into a separate DataFrame. However, only the datasets required to answer the business questions were preprocessed. Therefore, the data preparation process focused on the following four tables:
- Customers
- Sellers
- Orders
- Order Items
For each table, the following preprocessing steps were performed:
- Checking for missing values using isnull().sum()
orders_df.isnull().sum()- Checking for duplicate records using duplicated().sum()
orders_df.duplicated().sum()- Reviewing data types with info()
orders_df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 order_id 99441 non-null object
1 customer_id 99441 non-null object
2 order_status 99441 non-null object
3 order_purchase_timestamp 99441 non-null object
4 order_approved_at 99281 non-null object
5 order_delivered_carrier_date 97658 non-null object
6 order_delivered_customer_date 96476 non-null object
7 order_estimated_delivery_date 99441 non-null object
dtypes: object(8)
memory usage: 6.1+ MB- Converting data types when necessary to ensure they were suitable for analysis
orders_df['order_purchase_timestamp'] = pd.to_datetime(orders_df['order_purchase_timestamp'], errors="coerce")
orders_df['order_approved_at'] = pd.to_datetime(orders_df['order_approved_at'], errors="coerce")
orders_df['order_delivered_carrier_date'] = pd.to_datetime(orders_df['order_delivered_carrier_date'], errors="coerce")
orders_df['order_delivered_customer_date'] = pd.to_datetime(orders_df['order_delivered_customer_date'], errors="coerce")
orders_df['order_estimated_delivery_date'] = pd.to_datetime(orders_df['order_estimated_delivery_date'], errors="coerce")
orders_df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 order_id 99441 non-null object
1 customer_id 99441 non-null object
2 order_status 99441 non-null object
3 order_purchase_timestamp 99441 non-null datetime64[ns]
4 order_approved_at 99281 non-null datetime64[ns]
5 order_delivered_carrier_date 97658 non-null datetime64[ns]
6 order_delivered_customer_date 96476 non-null datetime64[ns]
7 order_estimated_delivery_date 99441 non-null datetime64[ns]
dtypes: datetime64[ns](5), object(3)
memory usage: 6.1+ MB