6 Things Learned in 6 Months of Journey as a Data Scientist

This is the story of every student that what is there in the exam is totally different from what was taught in the class.

In this article, I’ll try to pen down the gap between learning from online courses to working on professional projects.

Me in 15 seconds — For around half a year now, I have been working as a Data Scientist and it was quite a learning curve. It was quite a learning curve for me as I come from the Patent Analytics background (2yrs). There, patent trend estimation projects piqued my interest, since then I started learning more about Data Science field along with my job. I will spend my after-hours going through online courses and weekends…? participating in Hack-A-Thons.

However, after 8 months of self-learning when I joined as Data Scientist, I was at a loss in putting my knowledge into practice. There is a gap between course work and actual project work in organizations. Skills that are not taught in courses and could only be mastered once you step into the field.

So I’ve jotted down major 6 things learned in the past 6 months.


1. Data Collection and Data Preparation

In any typical analytics problem available online or in hackathons, data is already available and our task is just to prepare data (cleaning data, missing value imputation, and more). However, a business has multiple sources of data collection and storage (Oracle, SAS, MongoDB). Being a data scientist you will have to identify the relevant variables from these data sources. So if you are starting any project from scratch this process will consume a big chunk of your time.

When engaging in a project, spend an abundant amount of time getting familiar with the data and data preparation. Often you would aim to build a better model rather than improving the data you were building it on. However, there is a limit to which you can improve a model based on available data/variables. So if you have spent enough time and effort to improve the input data of the model, it will definitely help you in the longer run.

Take home tip: Spend abundant amount of time in getting familiar with the data.

2. Production Model (Model Deployment)

Generally, there are 7 steps of machine learning framework:
1. Data Collection
2. Data Preparation
3. Model Selection
4. Model Training
5. Model Evaluation
6. Parameter Tuning
7. Prediction

While development we do Exploratory Data Analysis (EDA), draw graphs, test hypotheses, and use different models. However, when the code goes into the production pipeline it should all automated. There should not be any manual involvement required in between the production model pipeline run.

3. Engineer’s nightmare: Production code failed

I’ve learned this step the hard way when my production code failed. Sometimes, we miss testing our code for some of the use cases. In such scenarios, error-handling comes to rescue and your production code does not abruptly stop when code fails.

Try-catch = Life-saver.

In addition to error handling, logging is also necessary. Logs are used to debug the failure point when the code fails.

4. Communication Skills

Skillful communication requires two aspects: one is to gather the correct and feasible technical requirement from the client and the other one is to convey what have you done to the client.

  • To and fro communication is very important with the client. Make them understand what models can offer and what are their limitations. Sometimes there is a mismatch in the client’s expectations v/s what models could offer technically.
  • Your technical expertise is a must but data presentation and visualization skills make a lot of impact. Client and management generally will be non-technical. In addition to calculative figures, use creative visualization to sell your model. Even if you did some state-of-art work, but if you were not able to convey its impact to the client, all that work will go in vain. Harsh truth!

5. Evaluation Criteria

While solving online analytics problems we don’t pay much heed to Evaluation Factor, which is one of the factors to convey the actual performance of the model.

Take home tip: Choose Evaluation Metric wisely to make it more impactful.

For example, we have a dataset as below
Total patients: 100
Diabetics: 5
Healthy: 95
Here, even if our model only predicts the majority class i.e. all 100 people are healthy, we have a classification accuracy of 95%. Refer to this article for choosing evaluation criteria in case of a classification problem.

6. Business Impact

Being a data scientist, along with focusing primarily on model building, you should also focus on the business impact that models are making. Plus, its pretty fun to know the stats.

Bonus Point. 20% Rule: Learn new things

While a Data Science job will keep you occupied throughout the day, try to spare some time (like 10–20%) to keep yourself updated about new developments related to the core work. Rest 80% time would anyways be spent on core projects.


Conclusion

In the above article, I discussed mainly 6 points:

1. Data Collection and Data Preparation
2. Evaluation Criteria
3. Production Model (Model Deployment)
4. Error Handling
5. Presentation Skills
6. Business Impact

If you have any comments or questions feel free to leave your feedback below. Follow me up at Medium. You can always reach me on LinkedIn.

Leave a Comment