Data Den: Part 2 Thought Leadership with Michael Krause of Beyond Limits
4 March 2021
Los Angeles, CA
Data Den is a thought-leadership alcove within the world of Beyond Limits where we provide an opportunity to dive into the minds of our gifted data scientists to get a better understanding of their domain. Keep reading to catch a glimpse of their essential expertise; without it, artificial intelligence wouldn’t be possible.
Can you give us more insight into the nature of big data sets?
Data comes in all shapes and sizes but the challenges of leveraging that data are common and widespread. On the collection side, even quantitative data has some uncertainty: collection devices are not perfect, baselines drift, external stimuli can affect a reading, etc. On the storage side, standards may be non-existent or change with time, data is not centralized, incomplete, in various formats, etc. I recommend establishing data standards as the first step in all engagements – existing data can then be properly standardized and future data collection will be readily ingestible. Establishing quantitative standards is straightforward and typically requires agreement on naming, units, data coverage, etc. Qualitative standards are more challenging but may include examples to establish a common baseline. For example, a quality rating system may provide photographs or user stories to help a user determine an appropriate value.
In many cases, industrial AI data sets aren’t extremely large, often in the gigabyte to tens of gigabytes size range, which may be large by PC standards but not by so-called “big data” standards. When you’re working with data sets like that it really comes down to good programming practice, it is straightforward but still requires conscious effort to do it right. Examples include making use of efficient algorithms, designing efficient workflows, storing data in sparse matrices and vectorizing operations. In some cases, it’s necessary to test different libraries to find the most efficient implementation of an algorithm. In other cases, you may even need to go back to fundamentals and write your own algorithm.
Writing algorithms from scratch can be tedious and time-consuming but, in many cases, pays large dividends in performance. While data scientists often focus on correctness and completeness, poor performance will lead to a poor user experience and ultimately be a barrier to adoption. On the machine learning side, most algorithms scale well through parallelization – and cloud computing has enabled virtually limitless scalability. With so much raw horsepower available, the sky really is the limit for AI model training and deployment. In most situations, however, the real bottlenecks are budget, time, and the value of the problem to be solved. When developing the prediction pipeline be mindful of the end-goal, leverage data, and develop features at the scale appropriate to generate a meaningful prediction. Generalization is more important than perfection.
Building on this thought – Can you talk a little more about how all of this plays into workflows that contribute to the user experience?
For any user-facing applications, the user experience should be front and center throughout the development process. For data scientists, this primarily means accurate results and fast execution. I generally approach these in this order: first, we run an R&D phase where we analyze data, engineer features, and test different prediction approaches to measure algorithm accuracy and build an accurate prediction process. If we are working on a data-sparse or scientific domain-heavy application, then we also include a phase to determine where and how to best embed knowledge in our hybrid solution. After we have a workflow that provides an accurate solution it is time to go back and build an efficient deployable version that meets our performance goals.
We have already discussed the importance of good code design – in practical applications, good performance also means effectively leveraging resources. Training accurate machine learning algorithms takes time and most commercial solutions involve a workflow of multiple evaluations, each of which potentially requires real-time training and updating. One of the benefits of cognitive reasoning is that it does not require training, knowledge is directly evaluated based on feature input. However, even with an efficient code design, making tens or hundreds of thousands of evaluations in real-time often means leveraging as many resources as possible to ensure the delay between pushing run and seeing the result works seamlessly.
If we consider the overall process, we can often identify additional areas for further improvement. In many cases, scenarios can be grouped for fewer evaluations, data transmission volume can be reduced to minimize latency, dimensionality reduction can simplify models, and workflows can be parallelized for even better performance. Every problem will be different but it is important to think holistically, algorithms, workflows, and resources should all be optimally leveraged to provide a good user experience. An example of this approach at work here at Beyond Limits, when it came time to push an R&D project to production the execution time was roughly one day and required improvement. Every part of the code was profiled, every process was broken down, and every bottleneck was identified. Over the course of a few weeks, we were able to reduce the execution time to roughly one minute without any degradation in output quality.
Any advice for data science students, data scientists just getting into the job market, and/or professionals potentially wanting to change their career path to data science?
The market has evolved a lot in the last ten years, in both good and challenging ways. For one, the six-week immersion course Data Scientist is no longer. While those days are behind us, what it really means is that people who want to get into the field need to put in the leg work and prove their mettle. Ultimately, that is a good thing for the industry. If it really were that easy – and anyone could transition overnight – it devalues the work we do, and that’s not a career path anyone should want for themselves. There has also been an explosion of algorithms, libraries, and approaches to solve problems – a data scientist should be familiar with this world but doesn’t need to be an expert in all of it. Like with any science, choosing an area or domain and specializing in it has become the norm among classically trained data scientists. Even more common are people who are classically trained in a different domain but focused on leveraging data science tools in that domain during their education. An example might be a neuroscientist who used image recognition to diagnose anomalies from brain CT scans.
A data scientist should be familiar with the standard algorithms, their strengths, weaknesses, underlying principles, and be able to explain how they work. AI-based solutions, if designed improperly, can have (at best) misleading or offensive results such as mislabeled images in a google search>, to (at worst) dangerous results that seriously affect real people in everyday life. Data scientists wield powerful tools and we have an obligation to use them responsibly.
For people working in areas or companies related to AI, wanting to get familiar with the domain, online immersion courses are a great way to get to know the landscape. Though, for those looking to make a transition, the best place to look is often right where they already are. Many companies are actively building data science teams and look internally for candidates. Being proactive to learn the field through online and weekend courses will give you a leg up, and the support of an employer – while you increase your skill level and gain experience – will be a critical factor to a successful transition.
At its core, data science really is about exploration. Having that entrepreneurial spirit and curious mind to try new things, to be undaunted by failure and to think outside the box, are the foundations the entire field is built on.
Favorite publications, websites, blogs, conferences you attend, or books you read that you find helpful to your work?
- Kaggle is a data science-oriented website with a lot of great data sets. Professionals will perform analysis and post it on there; so, it’s really good for exploration and building experience.
- There are a ton of data science blogs and websites but one of my favorites is the data science section on Medium which is great for digestible and high-level articles.
- Towards Data Science and GeeksforGeeks are also a couple of favorites with the latter being a great reference page for coding.
Any myths/falsehoods/misunderstandings about data science you want to debunk?
Just to re-iterate, the overnight data science career is just not a real thing. There are a lot of people with their YouTube series on how to be a Data Scientist. Don’t be misled, if it sounds too good to be true, it is. Second, Terminators and AGI are not coming just over the horizon. There are very real limitations to this type of technology and machine learning is not “intelligent” like a person is. Algorithms don’t make arbitrary decisions; they execute explicit unambiguous instructions or learn patterns and then repeat those patterns forever.
WANT TO LEARN MORE ABOUT OUR PIONEERING SOLUTIONS? TAKE A LOOK AT OUR TECH HERE.
Dr. Michael Krause is Senior Manager of AI Solutions at Beyond Limits, a pioneering artificial intelligence engineering company creating advanced software solutions that go beyond conventional AI. Michael specializes in industrial AI with experience from bespoke AI solutions at small businesses to digital transformation at large enterprises. Prior to joining Beyond Limits, Michael was Director of Analytics at Tiandi Energy in Beijing, China, and later at Energective in Houston, Texas. Michael holds a Ph.D. in Energy Resources Engineering from Stanford University.