// the find

donnemartin/data-science-ipython-notebooks

★ 29,162 · Python · NOASSERTION · updated Mar 2024

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

A collection of ~100 Jupyter notebooks covering the classical data science stack: NumPy, pandas, matplotlib, scikit-learn, TensorFlow 1.x, Theano, Keras, Spark, and AWS tooling. It's a reference dump rather than a course — useful for someone who needs a quick working example of a specific technique without digging through official docs.

The breadth of coverage in one place is genuinely useful — you can jump from a pandas merge question to an SVM example without context-switching repos. The scikit-learn section is well-structured with dedicated notebooks per algorithm (knn, svm, pca, gmm, random forest) rather than one monolithic file. The Kaggle notebooks (Titanic, churn) show end-to-end ML pipelines with real data, not toy examples. The MapReduce/mrjob section with actual S3 log parser code and unit tests is more concrete than most tutorial repos.

Frozen in ~2017. TensorFlow 1.x and Theano are both dead — Theano hasn't been maintained in years and TF2 has a completely different API. Anyone running these notebooks today will hit immediate import errors without significant downgrade gymnastics. No requirements.txt or pinned environment file with working versions, so reproducing the environment is a guessing game. The deep learning content covers only fundamentals (basic CNNs, LSTMs) and stops well before transformers, attention, or anything post-2018. The AWS section uses boto2-era patterns that are largely obsolete.

View on GitHub →