10.27.2021 Checklists as a person in tech

Interviews, Preprocessing data, Reading Technical Papers in 6 minutes.

If you have followed my tech interview guide, you know that I am a big proponent of toolboxes. In the past, I have talked about data structures and algorithms to have in your “interview toolbox” to prepare you for any leetcode interviews. Today, I want to talk a little more about how you can extend that rule to anything in the industry with a few examples. As an applied scientist, I work with data on a day to day basis. More specifically, it is a lot of location data since I work for Bing Maps. Using buzzwords like “machine learning” and “transformers” and “embeddings” sometimes makes me feel very fancy but the truth is, I spend a lot of time cleaning up and understanding my data before I can work on it. However, I have considerably reduced the amount of time I take in this first preprocessing step by having a checklist (I am avoiding filling this post up with the word “toolbox”). Let me walk you through my checklist:

1) Import the libraries. The most common libraries I used for any task are numpy, pandas and matplotlib. If you are new to data processing, numpy helps to convert data to arrays and matrices. We want to do this since the most commonly used library for training models, scikitlearn, expects data in the form of arrays and matrices. Pandas is very helpful to create data frames (or tables) and finally, visualizing data makes interpretation much easier. Matplotlib is the library which helps to make those beautiful charts and graphs with your data.

2) Import the data. Of course, the MVP of our task is to get the data. There are very few times that I can’t convert the data I am using into a data frame. I may be biased by the type of work I do and the exposure that I have had to projects in the past, but pandas makes everything easier by reading all the data into an easy-to-understand table. I can then decide on the next steps.

Usually, the next steps involve understanding:

a) Are there any columns with missing data? If yes, are they numerical columns or categorical columns?

b) What are the columns that I will be using as my features?

c) What is my target column?

d) How can I fill up missing data? Is there enough missing data that I need to impute the values or can I just get rid of the rows/columns with missing data?

e) Should I encode the categorical columns? More often than not something as simple as one hot encoding will solve my problems.

f) Does my target variable need any encoding?

3) Next, I will keep aside my test data, which means I will split all of my data into training and testing data, usually an 80:20 split, sometimes a 90:10 split if I have enough data. “Enough” is still subjective at this stage. The reason we want to do this split as soon as possible is :

a) you are less likely to forget you want to keep your test set aside

b) you don’t bias yourself or the model at all

c) the next step is feature scaling, as we always want to make the split before we do that

4) One of the confusions I always had when I started out was whether to split before or after feature scaling and the answer I now know, is before. We want the test set to be absolutely untouched - and reflect the randomness of the data that the model will likely be used to evaluate. And so we split, scale the features in the training set, and then use that training mechanism to scale the test set.

The reason we want to scale the features is so that any one feature does not dominate the model. Imagine a column that has values between 500-600 and the other between 1-10, it is possible that the model interprets the first feature as more important than the second one. To avoid this, we can scale all the features to the same range. The most common methods are standardization and normalization. I usually prefer standardization since it works well on all data whereas normalization works best on data from a normal distribution.

And that’s it! My data is ready for me to visualize, play around with models, etc.

Another toolbox I use outside of interviews and on almost a daily basis is how I read technical papers. This is also how I read anything, maybe even a news article or a blog post. There are many reasons I read a technical paper - to understand a new field, to find a solution to a technical problem I am trying to solve, be current in my field of work. But there’s tens of thousands of papers available to us and it is impossible to read all of them. So I have my checklist and here is how I go about it:

1) Use google scholar / bing.com (notice how I didn’t say google.com -> too many sponsored results and not too much content, while bing does a decent job at returning specific papers) to do a search on specific keywords

2) Download a few papers and have a “first pass” at them. In this first pass, I don’t spend more than 15 minutes per paper. This is only to understand whether I actually want to read the paper, and to judge if it is relevant for my current needs. This first pass involved reading the abstract and the introduction to gauge whether it meets my needs.

3) If no, I will move on to the next paper. If yes, I will spend around an hour or so in a “second pass”. During this iteration, I just want to get a summary of the paper enough to explain a person sitting next to me what the paper contains. So that involved looking at the metric calculations, figures, graphs, the conclusion section. At this time I also judge the “quality” of the paper. If the labels look incorrect or it looks like the authors have done a shoddy job on the calculations, it is usually a signal to quit reading the paper.

4) Now if a paper passes the third pass, I will usually spend anywhere between 1-5 hours on the paper to understand it enough to reproduce the results of the paper. Sometimes I don’t understand the paper even after these 5 hours - this is usually when the paper is from a field I am not familiar with. But sometimes also when I am very new to the topic I am trying to research about. I have now read enough papers for me to not get discouraged by not “getting” the paper. I move on and hope there will be another one that will “click”.

I have another checklist to find and read review papers (and I feel very lucky when I find a good review paper) but I am now getting really tired of typing and I think I should give myself a break before I burn out on my first day of writing the posts.

About burnout, another time.

Previous
Previous

I am tempted to build a second brain

Next
Next

#100DaysOfML Reading Tracker