ML Exploration: Titanic Dataset

In summer 2019 I blogged about how I was taking a couple months to work on Machine Learning.

Since then I’ve mostly focused on software for the Mac and server-side development. My ML hands-on knowledge was getting a bit, rusty… Plus, things have evolved a bit: new technologies, new approaches, new concepts… Perfect timing as the new edition of the Hands-On ML with Sikit, Keras and TensorFlow book was recently released. 

I’ll be re-reading it and redoing all exercises. Below you’ll find the first major exercise I completed yesterday, the Titanic dataset. Today I just started a SPAM filtering model, excellent book. 

Marc

— 

import pandas as pd
import matplotlib.pyplot as plt

#—– INITIAL SETUP
titanic_train_data = pd.read_csv(‘titanicData/train.csv’)

X_train = titanic_train_data.drop(labels=‘Survived’, axis=1).copy()
y_train = titanic_train_data[[‘Survived’]].copy()

#—– DATA EXPLORATION

X_train.head()

X_train.count()
#We have a total of 891 entries. Not known for all are:
# -Cabin information is not known for all with 204 entries.
# -Age is not known for all with 714 entries.
# -Embarked is not known for all with 889 entries.

X_train.describe()
#Key insights:
# – People are quite young with median at 28 and mean at 29
# – Most people where in 2nd or 3rd class.
# – Most people did not travel with siblings or spouses SibSp, Same re. parent or children Parch.
# – Fare changes significantly and could be an indication of quality of the room.
X_train[‘Pclass’].unique()
#3 class types.

X_train[‘SibSp’].unique()
#From 1 to 8.

X_train[‘Embarked’].unique()
#S, C, Q or nan.

X_train[‘Sex’].value_counts()
#More male than female, 577 male vs 314 female.

X_train[‘Pclass’].value_counts()
#A lot more third than first, funnily enough more 1st than second.
#Plotting split between classes
plt.pie(x=X_train[‘Pclass’].value_counts(), labels=X_train[‘Pclass’].unique(),autopct=‘%1.0f%%’ )
plt.legend()
plt.show()

#Plotting where people came in the titanic
plt.bar(x=[‘S’,‘C’,‘Q’] ,height=X_train[‘Embarked’].value_counts())
plt.show()

#—– DATA PREPARATION

#Feature engineering, combine Siblings and Spouses together with Children and Parents
#X_train[‘Siblings’] = X_train[‘SibSp’] + X_train[‘Parch’]

#Remove data we won’t be using
#X_train = X_train.drop(columns=[‘PassengerId’, ‘Name’, ‘Ticket’, ‘Cabin’, ‘SibSp’, ‘Parch’])

#Test that it worked correctly
#X_train.head()
#X_train[X_train[‘Siblings’]>1]
from sklearn.base import BaseEstimator, TransformerMixin

class PrepareData(BaseEstimator, TransformerMixin):
‘Feature engineering, all custom changes are done in this class’
def __init__(self):
pass
def fit(self, X, y=None):
return self
def transform(self, X):
print(f‘About to {len(list(X))} items -> {list(X)})
X[‘Siblings’] = X[‘SibSp’] + X[‘Parch’]
print(f‘Having {len(list(X))} items -> {list(X)})
X = X.drop(columns=[‘PassengerId’, ‘Name’, ‘Ticket’, ‘Cabin’, ‘SibSp’, ‘Parch’])
print(f‘Returning {len(list(X))} items -> {list(X)})
return X

#—– PIPELINE
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

num_pipeline = Pipeline(
[
(‘imputer’, SimpleImputer(strategy=‘median’)),
(‘std_scaler’, StandardScaler())
]
)

#Get the headers
X_train_num_cols = [‘Age’, ‘Siblings’, ‘Fare’, ‘Pclass’]
X_train_cat_cols = [‘Sex’, ‘Embarked’]
#Get numberical values and non numerical values
ext_pipeline = ColumnTransformer(
[
(‘num’, num_pipeline, X_train_num_cols),
(‘cat’, OneHotEncoder(handle_unknown=‘ignore’), X_train_cat_cols)
]
)

full_pipeline = Pipeline(
[
(‘custPrep’, PrepareData()),
(‘ext_pipe’, ext_pipeline)
]
)

X_train_prepared = full_pipeline.fit_transform(X_train)

#—– MODEL TRAINING AND PREDICTION USING KNEIGHBORS (TO START WITH ONE)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

neigh_clf = KNeighborsClassifier(n_neighbors=3, n_jobs=-1)
score = cross_val_score(neigh_clf, X_train_prepared, y=y_train.values.ravel(), cv=5)
score.mean() #80% is not bad considering 60% died and 40% survived

#Death rate
y_train.value_counts()[0]/(y_train.value_counts()[0]+y_train.value_counts()[1])
#—– IMPROVE MODEL THROUGH GRID-SEARCH
from sklearn.model_selection import GridSearchCV

param_grid = [
{
‘n_neighbors’:[3, 15, 30, 40, 50],
‘leaf_size’: [15, 20, 30, 35, 45],
‘weights’: [‘uniform’, ‘distance’]
}
]

neigh_clf = KNeighborsClassifier()
grid_search = GridSearchCV(neigh_clf, param_grid, cv=3, return_train_score=True)
grid_search.fit(X_train_prepared, y_train.values.ravel())
grid_search.best_params_
grid_search.best_score_
#{‘leaf_size’: 15, ‘n_neighbors’: 30, ‘weights’: ‘uniform’}
#—– PREPARING FOR SUMBISSION WITH IMPROVED MODEL
neigh_clf = grid_search.best_estimator_
neigh_clf.fit(X_train_prepared, y_train.values.ravel())

X_test = pd.read_csv(‘titanicData/test.csv’)
#y_test_withId = pd.read_csv(‘titanicData/gender_submission.csv’)
#y_test = y_test_withId.drop(columns=[‘PassengerId’])

X_test_prepared = full_pipeline.transform(X_test)

from sklearn.metrics import accuracy_score
y_test_pred = neigh_clf.predict(X_test_prepared)
#accuracy_score(y_test, y_test_pred) Can’t use as y_test data is fake. Need to submit to kaggle to get the right data

#—– USE SVM (TO TRY ANOTHER MODEL)

from sklearn import svm

svm_clf = svm.SVC(kernel= ‘poly’)
svm_clf.fit(X_train_prepared, y_train.values.ravel())
y_test_pred = svm_clf.predict(X_test_prepared)
#accuracy_score(y_test, y_test_pred)

#Lets try with linear kernel
svm_clf = svm.SVC(kernel= ‘linear’)
svm_clf.fit(X_train_prepared, y_train.values.ravel())
y_test_pred = svm_clf.predict(X_test_prepared)
#accuracy_score(y_test, y_test_pred)
#We can find as well coeficiants of feature importance
svm_clf.coef_[0]

#And confusion matrix
from sklearn.metrics import plot_confusion_matrix
plot_confusion_matrix(svm_clf, X_test_prepared, y_test.values.ravel(),
cmap=plt.cm.Blues)

#And directly calculating numbers and graphing it in a diferent way
from sklearn.metrics import confusion_matrix
#conf_mx = confusion_matrix(y_test, y_test_pred)
#plt.matshow(conf_mx, cmap=plt.cm.gray)

#—– GET READY FOR SUBMISSION TO KAGGLE
y_test_withId = pd.read_csv(‘titanicData/gender_submission.csv’)
y_test_withId[‘Survived’] = y_test_pred
y_test_withId.to_csv(‘submission.csv’, index=False)

NewsWave 2021.5 for Mac & iOS

I’m happy to report that NewsWave 2021.5 for Mac & iOS has been submitted to the App Store.This is a minor update for both apps, focusing on improving stability and addressing minor edge case bugs. 

This includes better handling of posts returning ‘NULL’ as the summary or edge case handling for certain websites, like ‘Engadget’, returning “'” instead of an apostrophe. 

If you have any comments or feedback do reach me @MarcMasVi on Twitter or marc.maset@hey.com

Hope you enjoy the update,

Marc

Inspecting mac pkg installers

Have you ever wondered what an installer package is up to? Why is the developer not just providing a dmg with an app? Well, wonder no more! Behold… ‘Suspicious Package’

Screen Shot 2021 01 18 at 10 35 00 PM

Yes, the app name is ‘Suspicious Package’ and it’s awesome. Just drop a package on it and it will tell you exactly what the package its up to:

Screen Shot 2021 01 18 at 10 37 20 PM

I wished I knew about this app sooner. Also, its totally free and you can get it from here. Kudos to the great work of the indie developer behind it. 

Marc

NewsWave for Mac 2021.01

NewsWave 2021.01 for Mac is live in the App Store. This update is all about Big Sur and Apple Silicon. If you’re on the latest OS, it will improve the app big time. 

Key changes include:

  • NewsWave is now a Universal Binary for Apple Silicon & Intel.
  • Fully compatible with macOS Big Sur: table navigation, cell selection, icon appearance…
  • Several minor improvements & bug fixes, especially around synchronization.
  • Privacy labels so that users know exactly what information is used and for what. 

Screen Shot 2021 01 17 at 5 12 08 PM

I hope you like it! As always, if you have any feedback please do get in touch. 

Marc

2020 in review

One of the things I really enjoy about the Christmas break, is how much it contributes to looking at things with perspective. It may be the copious amounts of food, the change in schedule, the time to think…

Whatever the reason, it really helps assessing how things have gone and where to go next. On this post I’ll be focusing on the former. 

Looking back at the roadmap for 2020 I published last year, there were 3 major milestones I was planning for 2020. Here’s the end of the year summary:

  • NewsWave onboarding redesign: improving the onboarding experience to make it seamless without loosing any user features. 
    • In January, the 2020.1 update completely overhauled the onboarding user experience. It effectively made it frictionless through a combination of server-side and app changes. 
  • NewsWave for Mac: releasing a fully featured Mac-native version of NewsWave. 
    • After several months of development, NewsWave for Mac was launched in May 26th. Since then, it has had multiple updates to improve the experience and further refine it based on customer feedback. 
  • Excelling: update its codebase to leverage newer technologies introduced since its launch. 
    • Shortly after the release of the macOS version of NewsWave I decided against rewriting Excelling in SwiftUI & swift yet. The app is still performing correctly and I do not feel the improvements from a rewrite justify the opportunity costs. 

In addition to the above, I started spending more time on non-apple technologies such as Python, LAMP Systems and ML. This will allow me to create better FullStack, multiOS applications and services in the future. 

I’m quite pleased with the progress in 2020 and I’m really excited about all the 2021 possibilities. I’ll be focusing on that part on an upcoming post. 

Side note, if you’re interested in Python, or you’d like to refresh your knowledge I strongly recommend this free course from dabeaz -> https://dabeaz-course.github.io/practical-python/ 

Comments/feedback? Do reach me @MarcMasVi on Twitter or marc.maset@hey.com

Marc

NewsWave 2020.4 for Mac

Today I submitted to the App Store what will likely be the final 2020 update, NewsWave 2020.4 for Mac. This is a minor bug fix update to improve unit testing, address bugs & improve UX. 

Key 2020.4 changes include:

– Fixed a bug that could prevent reading position from syncing correctly. 

– Improved UX in several areas, including when a feed has no posts to show. 

– Improved debugging & unit testing. 

Provided all goes well with App Review, it should become available for download in the next couple of days. 

If you have any comments or feedback do reach me @MarcMasVi on Twitter or marc.maset@hey.com

Hope you enjoy the update. Until next time, 

Marc

NewsWave 2020.3.1 for Mac

Today I submitted NewsWave for Mac 2020.3.1, a minor bug fix update.

I would typically wait a few more weeks to combine more enhancements & fixes but this update addresses a specially elusive and annoying bug. 

If you opened NewsWave for Mac from scratch and immediately moved it to the background (i.e. doing something else while the app fetched new content), the app would -sometimes- not scroll correctly to the latest article you had read.

As with most complex bugs, this seemed to happen at random, making it very ‘fun’ to track down. On top of the conditions above, the bug would only trigger if the user had used another device -i.e. an iPhone- and read newer content. 

In addition to the ‘fun’ bug, this release adds a couple other minor improvements for users that like the ‘Directly opens web page’ setting. Provided there’s no surprises with App Review it should be available in a day or two. 

If you have any comments or feedback do reach me @MarcMasVi on Twitter or marc.maset@hey.com

Hope you enjoy the update. Until next time, 

Marc

NewsWave 2020.3 for Mac

I’m happy to report that NewsWave 2020.3 for Mac has been submitted to the App Store.

This update improves the app based on the feedback received since launch. In addition to bug fixes and UX improvements I’ve also taken the opportunity to expand the amount of unit tests that verify each app change and I’ve tweaked the App Store name from ‘NewsWave Reader’ to ‘NewsWave – News Reader+’ to improve discoverability. 

Provided all goes well with App Review it should become available for download in the next couple of days. 

Key 2020.3 changes include:

-Fixed a bug that could result in the setting “Show images in Feed” being ignored.
-Search text will now be correctly reset if user clicks on its sidebar icon.
-When removing an article from the bookmarks section using the key shortcut, the next article now becomes selected.
-Improved wording on helper messages explaining how to add more devices to the user subscription.
-Fixed bug that would show an incorrect dark-mode background color when a search for feeds returned no results.
-Fixed bug that could trigger a message suggesting to add feeds when the right conditions were not met.
-The app may trigger a one-time rating request if the user has read all articles and has been using the app for quite some time.
-Fixed bug that allowed selection of multiple cells if the spacebar was pressed.

Comments / questions?  You can reach me @MarcMasVi on Twitter or marc.maset@hey.com

Hope you enjoy the update, please let me know if you have any feedback. Until next time, 

Marc

A different approach to email with ‘Hey’

Email, one of the most widespread technologies of all time, it has enabled so much… At the same time, it was designed a long time ago when the internet was very different. 

When I read about Basecamp’s attempt at improving email with ‘Hey’, to address many of its current shortcomings I was intrigued. I spend quite a bit of time on emails after all…

Hiw hero eba1bd6c04c35d82d59934dce730292d83bb15694f66ff23cf7b41b286e1d738

After a few weeks I have to say it’s a very interesting concept. I’ll keep my current setup for now, but I found it compelling enough to subscribe for one year. I will try using the address for all development engagement with my customers, their features will come quite handy.

Even if you’re not interested in switching, their approach is well worth a read. There’s also a video from the CEO where he walks through the features. 

Comments / questions?  You can reach me @MarcMasVi on Twitter or marc.maset@hey.com

Marc

Unleashing the server developer in you

Back in 2016, when the idea that would become NewsWave was humming in my head, I was sipping my morning coffee while listening to an episode of Under the Radar. 

In that episode, Marco Arment and David Smith were discussing how they used servers to manage Overcast and Feed Wrangler. I was already considering using servers but after that I had decided.  

NewsWave Reader was the first app I developed that includes a server component, two years since release here are some of the learnings and experiences that can help if you’re planning to get into servers too. 

Depending on the app/service you want to create, using a server comes with many benefits: easily syncing devices, managing payments, providing a searchable repository of information, offloading data tasks from the device to the server, running ML models, crawling the web… 

Now, let me clarify something, if you do not need a server, do not use a server. They are overhead, they add another layer of complexity to take care of. Not only that but you’ll need to account for privacy (how much information should you store vs. not store), security (are you covering all the bases to avoid being hacked), scalability (how would your service handle exponential growth)… In addition to that, the more people use your app the more cost you’ll have, API call optimization is key. 

But what if you need to use a server, what if your new awesome idea for an app/service requires it. If that’s the case, I have great news, it really is not that hard.

Before starting any discussion on setup I’d suggest you to think about your business model. As I just mentioned, servers have cost, and the more users you have the more cost you’ll have. Make sure you have a business model that’s sustainable, and that’s easier said that done these days. I don’t say that lightly, during the first year of NewsWave I lost money almost every single month, be sure to learn from my mistakes. 

Once you have a solid business plan, what about the setup? As Marco suggested in his episode, if you’re an app developer you’re better off sticking with server-side ‘boring’ technologies: they are reliable, efficient and there’s plenty of documentation on the web. I could not be more thankful for his advice, my server stack is using what’s called LAMB: Linux, Apache, MySQL and PHP/Python. Let’s touch on each quickly:

Linux: I use the most solid and stable option possible: Debian. And when choosing what Debian version to use I went with the latest Long Term Release (at the time Debian 9), which gives me years of security updates before I need to update to the next major release.  

Apache: Old and trusted, it manages all my websites and web services. Plenty of documentation online, strengths and weaknesses are well known and very reliable. Also, it works easily with certbot for open source HTTPS free certification. 

MySQL: Here I was doubting between PostgreSQL and MySQL, both are reliable, scalable and heavily used in the industry. In the end I went with MySQL for the simple reason there was more documentation available in Linode (my host provider, more on that in a minute). 

PHP/Python/Pearl: I use a combination of PHP and Python: API’s are all PHP while internal server tasks and Machine Learning models are coded in Python. Again, plenty of documentation online, both languages widely used and not cutting edge. 

So let’s say you’ve decided to give it a go, you want to start experimenting, what are the next steps? How do you get started?

First you’ll need a host provider, someone that will host the server in a datacenter. There’s many options out there, I’ve been using Linode and am very happy with it. If you’re just getting started they have what’s called a nano plan for $5/month. 

Once you’ve signed up, you can easily create a new linode with the latest Debian 10 LTS and then… 

Screen Shot 2020 06 21 at 10 20 44

…just follow this instructions to get your server setup & secured with LAMB. Trust me, you’ll be up and running in no time. 

Once all is secured and installed, easily connect to it from your terminal and setup your SSH file editor of choice -I personally use terminal for mysql and all server maintenance and  Visual Studio Code for Python & PHP development-.

And that’s it, you’re good to go. From here you can start adding websites, training models, creating web services, adding crawlers, sky is the limit… If you mess up, just drop the server and start fresh. Backups are one click away as well, for when things are more solid and the option to rebuild does not look as enticing 🙂

– – –

Looking back I’m very happy to have gone this route, not only I could create NewsWave in the way I wanted but I’ve learned a ton. If you can I’d recommend listening to the Under the Radar episodes: into to servers and follow up questions

If you don’t need a server don’t get one, but if it will allow you to bring your idea to life, go for it. It looks a lot more scary than it is. 

Comments / questions?  You can reach me @MarcMasVi on Twitter.

Until next time, 

Marc