NumPy for Statistical Analysis: Descriptive Stats, Distributions, and More
ML-Libraries (Part 5)
📚Chapter: 1— NumPy
If you want to read more articles about Machine Learning Libraries , don’t forget to stay tuned :) click here.
Introduction
NumPy, short for Numerical Python, is a powerful open-source library widely used in the Python ecosystem for numerical and mathematical operations. One of its primary applications is statistical analysis, providing a robust framework for handling arrays, mathematical functions, and statistical computations. This blog will explore the various ways NumPy facilitates statistical analysis, from basic operations to more advanced techniques.
Sections
Descriptive Statistics
Working with Distributions
Hypothesis Testing
Correlation and Regression
Data Transformation
Random Sampling
Measures of central tendency
Measures of dispersion
Histograms
Conclusion
Section 1-Descriptive Statistics
Although NumPy
is not a library for statistical analysis, it does provide several descriptive statistics functions. In NumPy
documentation these are presented as “order”, “average and variances”, “correlating” and “histograms”, but all of those are just descriptive statistics. Also, keep in mind that pretty much any statistical package in Python you’d find around is based in NumPy
as its “engine” anyways [2].
NumPy simplifies the calculation of essential descriptive statistics, such as mean, median, variance, and standard deviation. For example:
import numpy as np
# Creating a NumPy array
data = np.array([1, 2, 3, 4, 5])
# Calculate mean and median
mean_value = np.mean(data)
median_value = np.median(data)
# Calculate variance and standard deviation
variance_value = np.var(data)
std_deviation_value = np.std(data)
Section 2-Working with Distributions
NumPy supports various probability distributions, making it easier to model and simulate data. You can generate random samples from distributions like normal, uniform, and binomial:
import numpy as np
# Generating random samples from a normal distribution
random_samples = np.random.normal(mean=0, std=1, size=1000)
Section 3- Hypothesis Testing
NumPy includes functions for hypothesis testing, a critical component of statistical analysis. You can perform t-tests, chi-square tests, and more:
# One-sample t-test
from scipy.stats import ttest_1samp
t_statistic, p_value = ttest_1samp(data, popmean=3)
Section 4- Correlation and Regression
NumPy facilitates the computation of correlation coefficients and regression analysis:
# Calculate correlation coefficient
correlation_coefficient = np.corrcoef(data1, data2)[0, 1]
# Linear regression
slope, intercept = np.polyfit(data1, data2, 1)
Section 5-Data Transformation
Numpy doesn’t have the data transformation features directly, but we can utilize the existing features to perform these [1].
Data Centering: Centering data involves subtracting the mean from each data point. This is often done to remove the effect of a constant term or to facilitate model convergence.
Standardization: This to transform numerical data in such a way that it has a mean of 0 and a standard deviation of 1. This process makes it easier to compare and analyze data with different scales.
Log Transformation: Logarithmic transformation is used to make data more symmetric or to stabilize variance in cases of exponential growth
# Data Centering
data = np.array([10, 20, 30, 40, 50])
mean = np.mean(data)
centered_data = data - mean
# Standardization
std_dev = np.std(data)
standardized_data = (data - mean) / std_dev
# Log Transformation
log_transformed_data = np.log(data)
Section 6-Random Sampling
Random sampling involves selecting a subset of data points from a larger dataset. NumPy also provides tools for generating random numbers from various probability distributions [1].
Sampling:
Simple Random Sampling: Select a random sample of a specified size from a dataset. When sampling without replacement, each item selected is not returned to the population.
Bootstrap Sampling: Bootstrap sampling involves sampling with replacement to create multiple datasets. This is often used for estimating statistics’ variability.
# Simple Random Sampling Without replacement
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
random_samples = np.random.choice(data, size=5, replace=False)
# Bootstrap Sampling
num_samples = 1000
bootstrap_samples = np.random.choice(data, size=(num_samples, len(data)), replace=True)
Generating random numbers: Here are a few ways to generate random numbers with the desired distribution [1].
Integers: Generate a random integer between a specified range using
np.random.randint()
Uniform Distribution: Generate random values from a uniform distribution using
np.random.uniform()
Normal Distribution: Sample random values from a normal distribution using
np.random.normal()
Binomial Distribution: Simulate binomial experiments with
np.random.binomial()
Poisson Distribution: Model rare events with the Poisson distribution using
np.random.poisson()
Section7- Measures of central tendency
Measures of central tendency are indicators of the center or typical value of data distributions. Let’s check the most common ones [2]:
# poisson1 = rg.poisson(5, 1000)
# poisson2 = rg.poisson(50, 1000)
import scipy.stats as sp
poisson1 = sp.poisson(5).rvs(1000)
poisson2 = sp.poisson(50).rvs(1000)
import matplotlib.pyplot as plt
fig, (ax1, ax2) = plt.subplots(1, 2, constrained_layout=True)
fig.suptitle('Sampling from poisson distribution')
ax1.hist(poisson1, bins=10)
ax1.set_title("Expectation of interval: 5")
ax2.hist(poisson2, bins=10)
ax2.set_title("Expectation of interval: 50");
import numpy as np
# Sample data
data = np.array([10, 20, 30, 40, 50, 20, 30, 40, 60])
# Mean
mean = np.mean(data)
print("Mean:", mean)
# Median
median = np.median(data)
print("Median:", median)
# Mode
mode = np.argmax(np.bincount(data))
print("Mode:", mode)
Section 8-Measures of dispersion
Measures of dispersion are indicators of the extent to which data distributions are stretched or squeezed. Let’s check the most common ones [2]:
rand_matrix = np.random.rand(5,5)
print(f"Pearson product-moment correlation coefficient:\n{np.corrcoef(poisson1,poisson2)}\n")
print(f"Cross-correlation coefficient:\n{np.correlate(poisson1,poisson2)}\n")
print(f"Covariance matrix coefficients:\n{np.cov(poisson1,poisson2)}\n")
print(f"Pearson product-moment correlation coefficient:\n{np.corrcoef(rand_matrix)}\n")
print(f"Covariance matrix coefficients:\n{np.cov(rand_matrix)}")
Section 9- Histograms
Finally, NumPy
also offers some convinient functions to compute histograms [2]:
import numpy as np
import matplotlib.pyplot as plt
# Generate random data
np.random.seed(0) # For reproducibility
data = np.random.normal(loc=0, scale=1, size=1000) # Generate 1000 random numbers from a normal distribution
# Create histogram
plt.hist(data, bins=30, color='blue', edgecolor='black', alpha=0.7)
plt.title('Histogram of Random Data')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
Conclusion
NumPy is an indispensable tool for statistical analysis in Python, providing a comprehensive set of functions for data manipulation, computation, and hypothesis testing. Its efficiency in handling large datasets and compatibility with other scientific libraries makes it a go-to choice for researchers, data scientists, and statisticians.
This blog only scratches the surface of NumPy’s capabilities in statistical analysis. Whether you’re conducting basic data exploration or implementing complex statistical models, NumPy empowers you to perform these tasks with ease and efficiency. Explore the documentation and integrate NumPy into your statistical toolkit for a seamless and powerful analysis experience.
Please Subscribe courses teach to learn more about Machine Learning libraries
🚀 Elevate Your Data Skills with Coursesteach! 🚀
Ready to dive into Python, Machine Learning, Data Science, Statistics, Linear Algebra, Computer Vision, and Research? Coursesteach has you covered!
🔍 Python, 🤖 ML, 📊 Stats, ➕ Linear Algebra, 👁️🗨️ Computer Vision, 🔬 Research — all in one place!
Don’t Miss Out on This Exclusive Opportunity to Enhance Your Skill Set! Enroll Today 🌟 at
Machine Learning libraries Course
🔍 Explore Tools, Python libraries for ML, Slides, Source Code, Free online Courses and More!
Stay tuned for our upcoming articles because we reach end to end ,where we will explore specific topics related to Machine Learning libraries in more detail!
Remember, learning is a continuous process. So keep learning and keep creating and Sharing with others!💻✌️
Ready to dive into data science and AI but unsure how to start? I’m here to help! Offering personalized research supervision and long-term mentoring. Let’s chat on Skype: themushtaq48 or email me at mushtaqmsit@gmail.com. Let’s kickstart your journey together!
Contribution: We would love your help in making coursesteach community even better! If you want to contribute in some courses , or if you have any suggestions for improvement in any coursesteach content, feel free to contact and follow.
Together, let’s make this the best AI learning Community! 🚀
Source
1-Mastering NumPy: A Data Enthusiast’s Essential Companion