Earned a credential demonstrating foundational knowledge of data analytics concepts, methodologies, and applications of data science. Gained a conceptual understanding of data cleaning, refinement, and visualization using IBM Watson Studio, and became familiar with the key tools and programming languages used in the data ecosystem. Developed insight into career opportunities and essential skills for success in the data and analytics domain.
Completed a course exploring modern data science and analytics practices, methodologies, and tools. Learned how data-driven approaches enhance innovation, decision-making, and customer engagement across industries. The course covered topics such as leveraging data for competitive advantage, fundamentals of machine learning and AI, ethical and societal implications of data use, and strategies for addressing cybersecurity challenges in data applications.
Objective: Identifying irregular CPU utilization patterns in AWS EC2 instances using robust statistical methods to distinguish between operational noise and genuine system failures.
The primary analysis method used in this project is Outlier Detection via the Modified Z-Score, utilizing Median Absolute Deviation (MAD) as the baseline for dispersion.
Why MAD? Traditional Z-scores rely on the mean and standard deviation, which are highly sensitive to the very outliers we want to detect. In our dataset, the bimodal distribution (two major data clusters) would cause a standard mean to sit in a "low-density" area, leading to inaccurate flagging.
The Baseline: By using the Median, we establish a center point that represents the most frequent state of the CPU, making the model resilient to extreme spikes or drops.
The Scaling Factor: The analysis uses a consistency constant of 0.6745 to scale the MAD, making it comparable to a standard deviation for a normal distribution while maintaining its robust properties.
The detection logic follows a conservative statistical standard:
Metric: Modified Z-Score.
Threshold (M): |Z| ≥ 3.5
Logic: Any data point that deviates from the median by more than 3.5 times the robustly scaled MAD is labeled as an anomaly (-1), while all others are treated as normal baseline behavior (1).
Distribution Insights: The Kernel Density Estimate (KDE) plot reveals a primary operational cluster between 1.6 and 2.1% CPU utilization.
Anomaly Identification: The analysis successfully isolates data points falling significantly outside this range. Specifically, the model identifies extreme drops (e.g., near 0.1%) that likely represent "idle" states or system downtime that falls outside standard operational windows.
Model Precision: By using a threshold of 3.5, the model minimizes "false alarms" that would occur if a standard 2.0 or 3.0 Standard Deviation (Z-score) threshold were used on this non-normal distribution.
Language: Python
Libraries: Pandas (data manipulation), NumPy (mathematical operations), Scipy.stats (robust statistics), Matplotlib (visualization).
Dataset: AWS EC2 CPU utilization (Numenta Anomaly Benchmark).
This project demonstrates an understanding of when standard statistical assumptions (like normality) fail. By implementing a Modified Z-score using Median Absolute Deviation, I created an anomaly detection system that is mathematically resilient to the 'pull' of extreme outliers, providing more reliable alerts for infrastructure monitoring
EC2 dataset
The dataset 1976-2016-president.csv contains detailed election results for U.S. presidential elections from 1976 to 2016.
From 1976 to 2016, the Democratic Party won the majority of the popular vote in presidential elections.
Thus, the Democratic Party had a clear majority in terms of popular vote wins over this period.
Using students data, the analysis display the prediction of student grades based on their activity on Moodle.