Data Science 101

February 8, 2024

Ask 10 different people to define data science, and you’ll likely get 10 different answers – and that’s just among developers! Ask the general public and you’ll likely get blank stares. But understanding data science – especially today – is more important than it’s ever been. We have more data than ever before in human history, which continues to increase exponentially, and our ability to solve complex problems depends upon using that data in new and meaningful ways.

But what does data science have to do with that? Everything, it turns out.

Data science – the ability to derive insights from data – is the foundation for things like artificial intelligence, which is about a machine or a software program’s ability to make decisions; generative AI, which turns those decisions into discrete creations like text or art; machine learning, which is about the ability to use models and algorithms to analyze and draw inferences in data; and deep learning, which is about using large amounts of data to recognize complex patterns in data.

Data science begins and ends with quality – quality of data on one end, quality of outputs on the other, and quality assessments throughout. Doing any form of data science at-scale, which is required for most enterprise or agency-level analysis, is the goal. And any errors produced in these environments are replicated at the same scale, reinforcing the importance of quality throughout.

Of course, the most important part of “data science” is “science.” We rely on the underlying assumption that we can stochastically characterize patterns and predictions in large datasets. To accomplish this requires rigorous processes to ensure we properly define our problem, control for as many contributing factors as is reasonable, and properly design experiments to test and validate our hypothesis and observations. This goes beyond academic research and is directly applicable in business settings. Experimental design can and should be used to test assumptions, process changes, and user behavior in a controlled setting. This becomes particularly difficult with models built on or producing unstructured data like text and images. Finch AI thrives on the challenge of defining and measuring model quality as AI output becomes increasingly subjective and varied.

But once a data science team has built and validated a new model, the development cycle isn’t over. There must be continuous monitoring to ensure a model’s performance over time. Anomaly events and level shifts in real-time data feeds can introduce perturbations in model behavior, and it is vital to detect these changes before they impact the user. Any financial forecast model trained in January 2020 was radically out of date only a few months later when the Covid-19 pandemic upended the world. Finch AI designs and deploys complex monitoring pipelines for our key services, ensuring that we are adapting in near real-time to structural and semantic changes in our data feeds.

Additionally, it’s important to choose the right tool for the job. You don’t need the largest neural network or large language model (LLM) for every job. Right-sizing data tools for data projects is essential – another term for this is model parsimony, or choosing a model that accomplishes what you need it to with minimal predictor variables. If you want to cut something, you sometimes need a hatchet, and you sometimes need a scalpel; the same is true when picking the right tool from your data science toolbox. This is reflected in product design as well. Instead of relying on document searches using free-text keywords or semantic similarity, we perform entity-driven search using AI and ML models to identify and extract entities then disambiguate them to our knowledge base. This boosts accuracy dramatically, and links to external sources discussing the same entities. You can be sure your documents refer to the right person or organization, even if there are others with the same name.

At Finch AI, we are a data-centric organization. Our AI products and our processes depend on data that is continually validated and, therefore, can be consistently trusted. Most of our products start with the modeling lifecycle. We monitor the space of emerging research and ideate high-impact solutions for our customers. We then conduct a feasibility study to determine the pragmatic boundaries of the possible, build and test a beta model, and then productionize the model as a service in one or more of our products.

Quality assurance (QA) is an essential function that runs concurrently to our entire development process. Our QA engineers are involved in curating our training and evaluation sets, validating model outputs, ensuring proper unit tests and regression tests, evaluate product integration, and perform detailed error analysis on user feedback. To ensure trustworthy AI products, quality has to be a key stakeholder at all stages in the DevOps process. In production, this includes quality assessments and model monitoring, as well as continuous upgrading of our data assets such as knowledgebases, whitelists, blacklists and more. As a core service, it includes operationalizing models and making data science a practice as well as a commitment that enables rapid development deployment.

Data science also underscores our ability to use data from and in multiple formats. It allows us to help customers make sense of structured and unstructured data, streaming and static, real time and archival. It helps us improve accuracy and completeness by giving our teams the ability to merge data from multiple sources. Looking ahead, data science and our transparent, data-centric focus is enabling us to explore parametric information extraction and multimodal generative AI.

Data science is a rapidly evolving and complex discipline. We at Finch AI deliver impact by ensuring that we design to high standards of excellence in our data, models, and quality processes. By embedding our models in our comprehensive software platforms, we remove the guesswork by automating the intelligence gathering functionality in a way that is trustworthy and seamless.

Learn more about how we apply sound data science throughout our organization and our offerings by visiting www.finchai.com.

###