How did I choose ‘Data Science’ ?

This is an account as to how I found Data Science to be a suitable answer to many problems. It is necessary to provide a little background to have the connections make sense.

I entered slavery fueled by a great passion for the subject of computational fluid dynamics (CFD), and it will always remain close to my heart. As a simplified overview – CFD starts with constructing a ‘strategically’ simplified geometry (CAD models) of the flow path of a fluid. Boundary conditions (inlet/outlet/wall etc) would be defined on these CAD models, which are then ‘discretised’, (i.e converted to a computational domain/mesh) to acceptable quality metrics. Meshing is like ‘an art form’ by itself as it significantly dictates the time required to solve the model, as well as accuracy of results. The conservation laws of physics (and other equations) are solved (by the solver) in each discretised cell resulting in a simulation of fluid flow. In a nutshell – the simulation would essentially generate ‘data’.

The resultant simulation data (usually several GB in size) had to be ‘post processed’ to extract visualizations of flow. The simulations had to be run on HPC clusters, even so it usually takes hours or even days (depending on the simulation type/mesh size/HPC config). However it was typically very structured data, with specialized tools and relatively fixed variables to visualize the results. (in Data science – you write code to do viz… though there are commercial s/w, eg Tableau). A critical part of data science, as well as CFD is the visualization of results.

‘Data’ by itself is the first connection. I loved CFD because fluid flow itself is quite complex – and the overall process involves skills in several areas to do well. The latter aspect is analogous to data science, and data science outshines CFD in most areas by comparison. There’s data cleaning, visualization, applying ML, and automation of the process including cross validating different ML models for performance metrics. Beyond this – there is a healthy connection to products, business and ROI as a default (unlike CFD).

CFD is predominantly done with commercial closed source softare (eg ANSYS/COMSOL). Though OpenFOAM (OF) (Open source) has evolved quite a bit over the years, to put it simply – there are some ‘limitations’ (for lack of better word) in it (both in code and approach) that make it less sensible to adopt at the industry level. All the softwares could/should be extended using scripts, User Defined Functions (UDF’s), using Python, PERL etc to implement your own solver or other automation. This is the ‘programming’ aspect of CFD. (Academically (or in very ‘rich’ companies like GE) – you’d look at customizing the algo’s of solvers/turbulence models itself – complex but cool shit idk too much abt anymore). Apparently, typical mech engg jackasses (even CFD d00ds) don’t code much, whereas I enjoyed hacking together scripts to automate ….so I was able to save oodles of time and effort n turn around team image/productivity level in many ways. One connection here is that I was always interested in coding. Also – Python / R etc based data science toolboxes are typically open source, even in the industry. This means being able to actually learn at home in a meaningful manner, rather than be constrained by the nature of CFD already mentioned. Only massive data sets or very involved ML algos would need relatively better computers, but there were workarounds.

As complex as they are, simulations are not real life. They are approximations, often as good as the d00d performing them. Only in some companies – the CFD Engg could look at the product as well and improve it hands-on it via simulations. Even so, there was heavy ‘compartmentalization’ – as in, at the end of his useful life – the CFD Engineer is unlikely to know much abt the product/customer/business. It’s like saying I designed/improved the core of this product – but have no real idea why/how it will be used and how the customer chooses this product. This was re-confirmed by investigating profiles on Linked in, job descriptions and also reaching out to experienced contacts. Further on – the utter stupidity of the industry is that if I perform CFD on ‘pumps’ – and want to shift to the simulation of ‘combustion’ (engines, turbines etc)- i am deemed to have ‘no experience’ !! despite having all that I need to do said work. I struggled with this for about 2 years. To practice CFD (on other topics) at home – I needed computing clusters. This in fact was driving my first foray into Linux – to build my own. However, learning to use OF at least back then – was quite tough in itself, and practically not useful for any job application either.

Combustion is incredibly fascinating. Multi-phase physics, plus chemical kinetics, plus moving geometry still sounds like a wet dream for me, and is among the most complex physics problems to simulate. Anyway – theoretical combustion is also a lot about data analysis – from experiments /testing, because there are considerable limitations as to what can be simulated accurately (especially considering finite resources and money.)

I wanted to touch and feel the product AND simulate. it wasn’t possible because of the way the industry compartmentalized/functioned – though the progressive orcs accepted such exposure was needed for a CFD Engineer to be really effective, and it is possible in a (single) handful of companies (in the west). I ditched CFD, i.e switched to a combustion equipment company in a role that can be translated as ‘technical sales’. i.e I got to use my ‘CFD knowledge’ very occasionally to show off among heathens in specific situations calling for it, while being in combustion and getting direct business exposure. I did not not actually perform simulations – but just explained the analysis of others. The main job involved analyzing tonnes of techno-commercial specs, as well as equipment performance data to troubleshoot on occasion, sales data to formulate quotes and support negotiations etc. This is another connection to ‘data analysis’ as a part of many other things. I do not consider ‘data analysis’ to be special in anyway, even now. It seems a normal thing to do within a lot of fields, not worth being called a field in itself.

After about 6 months of fruitless applications – the thought went something like this : OK I’m good at ‘analysis’ >> what ‘topic’ has an abundant number jobs that i have a good chance to project myself as a ‘suitable candidate’ and also actually crack based on existing background (and minimum effort)? Preferably one where I am not cuckolded into a specific industry/product/ etc? Or at least one wherein I can actually learn the deficit in a meaningful manner without needing a computing farm?

‘Business analytics’ / Comp Sci / IT / Fintech / Banks / Cloud shit >> atleast in Toronto these areas thrive.

OK > Business Analysis ? yucky, not technical enough. What else, programmer/dev? No real skills to break in, also sounds too back-end. Web Dev? Same, and less interesting than dev anyway.

‘Cloud Engg’? No real skills, need to work on certifications – possible but still felt too…. virtual and non-essential. Possibly flaky in terms of real world applications particularly w.r.t security. Clinching factor: no. of job postings relatively lower, though expertise was deemed attainable.

Data Analyst? WTF – its just data and analysis. big deal – everybody does that everywhere. Job sounds boring – but still – not bad as a start (compared to no job anyway). This is where I started to make some connections to CFD.

But then > ‘Data Scientist’? included analytics + coding + machine learning + customer & business involvement + data ‘engineering’ + computing aspects related to calculation / automation purposes. Finally something interesting. Data Science was incidentally hot as a topic i.e translating to hiring interest. Toronto itself has a lot of activity and some well known people in AI research. Salary ranges were incidentally quite nice! Job posts were much much higher in number including all kinds of companies!

No more castration by ‘specialization’ or by domain! I cd realistically switch sectors – i cd learn new concepts and prove expertise in another ‘domain’ say shift to Marketing to Fintech through demo projects researched on a local, simple computer. Even if said projects were ‘theoretical’ – it was at least not (almost) impossible like CFD! ML is even being used in conjunction with good old CFD in some cases like valve / engine design (still rare though)! See https://blog.insightdatascience.com/using-reinforcement-learning-to-design-a-better-rocket-engine-4dfd1770497a.

Is it all just data? The truth is, like CFD – ‘data science’ is a loose terminology for a collected bunch of tools/methodologies. You need domain knowledge to be truly effective in understanding the nature of data and extracting sense / value. However, I did have practical exposure. I had a ‘sense’ of what mattered to a customer and figure out what mattered to a business/product. This was an ‘idea’, which was incidentally reinforced (several times) by the quality of my answer versus those of ‘established working data scientists’ in course community forums in questions related to general sales / manufacturing / inventory and project management and applying analytics.

The tedious part of a data scientist/analyst job is in cleaning the data to make it suitable for ML algos and further analysis. As a rough distinction – the data analyst does not usually use ML. A data scientist uses ML. A Data Engineer looks at back-end implementation.

From what I’ve seen so far – with ‘some’ intelligence – it is possible to construct workflows to make daily tasks much more easier as you move along. That definitely applies to data science (particularly – snippets of reusable code), and this is also is how I saved time+effort at past salt-mines. Ultimately .. saving time+effort for me to do what I want (now TMSR)! If one could especially leverage the strengths of different languages seamlessly – like shell scripts, Python, and R – into a single analysis using literate programming approaches – it could save a lot of upfront effort. Eg: R for data cleaning, Python for ML and shell for stuff in between. Most idiots people don’t know that Emacs Org mode has had multi-language literate programming notebooks since years, and most won’t touch Emacs with a barge pole – because they need fancy GUI etc. For eg Rstudio released ‘multi language’ features in Rmarkdown only last year or so, and people still raving about a convenience that I’ve already been using for a while.

So, at the point of ’embrace’, around Mar 2018, I was already familiar with Linux, basic CLI (since some years already), Git (fluency is via Magit in Emacs), Literate programming, bits of Python (quite flaky), and other bits and pieces of ‘ general computing’, besides the fact I’ve been analyzing data since day 1.

As of today – I reasonably know SQL (recent), , much larger bits of R (reasonably comfortable), some Docker, marginally better Python. For the ERP project – I’ve written over 400 lines of R, cleaning earlier mess – not efficient code, and sort of one time use in this project, but some useful functions are constructed for re-use.).

Incidentally – I started learning Emacs only for using Org mode years before I was properly aware of the term ‘data science’. I could therefore manage/create analysis projects, code snippets, documentation, export to well formatted reports etc – with ease. These are things that Currently Employed data scientists struggle with, and is needed rather frequently.

I believed I already had several skills in place – now I just had to learn stats, a bunch of ML algorithms, put together some portfolio projects and ideally also have an official project related to ‘data’ – (in a sense the common folk understand) (> ERP!). Particularly : where I’ve failed is the portfolio of projects. I’d also say ‘wasting’ time going deeper into the math behind algos…and not taking time to review+connect all the concepts learnt. More on failure analysis later.

3 responses on “How did I choose ‘Data Science’ ?”

  1. “The tedious part of a data scientist/analyst job is in cleaning the data to make it suitable for ML algos and further analysis.” – from my experience though this is *also* the… largest part of the work really. Yes, the sexy bit (especially for beginners) is always the ML but the actual work consists in 95% data cleaning and pre-processing followed by – if you are lucky – the 5% ML.

    Reading your list and taking your evaluations there at face value (ie I assume them to be correct and base my conclusion on them) though, I’d say “Data Analyst” seems the more logical step since by the sound of it you already have everything you need for it so could start it tomorrow and even – if you really still find it’s worth it – use it as a spring board towards “Data Scientist”.

    So: why not Data Analyst and start applying already? Fwiw, self-evaluations by comparison of responses on forums doesn’t translate at all re employment. It can be useful perhaps only in that it gives you more confidence perhaps (and that in turn can help) but for all you know, it’s precisely that the employers are looking for those giving what you consider “dumber” responses – you don’t actually have any data regarding what *they* measure when making their decisions.

    1. Yes, Data Analyst is the logical first step and entry. The differences are really sort of ambiguous in terms of job descriptions. The evaluation is in fact gathered from various readings in terms of a sensible parting line. I will check if I’ve recorded any notes on it. The overlap I’ve seen is usually between Data Analyst and Scientist, since Data Engineering is sort of distinguished by being back end, and I believe a relatively a new ‘distinction’ of an aspect that was always an important part of the deployment.

      TBH : I have not really evaluated Data Analyst, until now, as the allure of doing ML was in fact blinding the knowledge I already had (re: cleaning being most of the process and above point), and also because the projected salaries of Data Analysts are generally lower for whatever reasons. However, this is in fact the first thing I have done as ‘revised strategy’ a few days ago : so it has started. I am in the process of creating a composite of the analyst profile descriptions to attack the areas of skill deficit and also initiate applications.

      Re: dumber responses. LOL. more like almost non-sense with no alignment to practical or even logical considerations, which I refined / expanded (significantly) ! Yes. Sadly, it appears there is very less info on what they measure and feedback on rejection is often non-existent despite several attempts to extract it in the past. That being said – this is certainly an attack area covering my entire application and quite important. The plan is to work on a new version (or level) of ‘dumbed-down’ resume in line with the composite, and I will post both for review soon.

Leave a Reply

Your email address will not be published. Required fields are marked *