Exploratory Data Analysis on Water Potability

Focused on data cleaning, trend analysis, and storytelling with visualizations, highlighting critical patterns in water conditions to support public health decision-making.

Tech Stack: Python, Pandas, NumPy, Matplotlib and Seaborn

Overview

Problem

The dataset contained water quality parameters collected over a decade, with the objective of understanding factors that influence whether water is potable. The main challenge lay in dealing with missing values, noisy measurements, and imbalanced data distribution across parameters such as pH, hardness, solids, chloramines, and sulfate. Extracting meaningful patterns required both rigorous preprocessing and careful visualization choices to avoid misleading interpretations.

Approach

  • Data Cleaning & Preprocessing
    • Addressed missing values using statistical imputation (mean/median) where appropriate.
    • Normalized features like total dissolved solids to ensure fair comparison across different scales.
    • Handled outliers by applying winsorization and boxplot-based detection methods.
  • Exploratory Data Analysis (EDA)
    • Used correlation heatmaps, pair plots, and distribution plots (via Seaborn & Matplotlib) to identify interdependencies among features.
    • Compared potable vs non-potable samples to discover key differences in pH balance, hardness levels, and chloramine concentrations.
    • Conducted time-series trend analysis to examine changes in water quality over the decade.
  • Design Choices
    • Focused on interpretable visualizations to make the findings useful for both technical and non-technical stakeholders.
    • Prioritized statistical validity (e.g., using log-transforms for skewed data) over purely aesthetic graphs.

Uniqueness & Impact

  • Identified 3–4 key features strongly correlated with potability (notably pH, hardness, sulfate, and chloramines).
  • Found that only ~40% of the samples qualified as potable, highlighting a significant risk area.
  • Provided clear, data-driven insights that could guide water treatment interventions and further predictive modeling.

Trades-offs

  • Imputation vs. Row Dropping: Retained more rows by imputing missing values, trading a small risk of bias for a larger, more representative dataset.
  • Complex vs. Simple Visuals: Chose clarity (heatmaps, bar plots) over overly complex visuals to make results accessible to broader audiences.
  • Balance Between Statistical Rigor & Usability: Opted for methods that struck a balance between accuracy and interpretability, ensuring findings could be acted upon in real-world scenarios.

Screenshots

Mobile view

Dashboard view