Data Science Pipeline "First Mile and Last Mile problems are like the dark matter of Data Science. They get 20% of the attention but are responsible for 80% of the outcome." John Tukey considered by many as the father of Data Science, famously said the following about the importance of solving the "right question". "Far better an approximate answer to the right question, which is often vague, than…
Keep Reading →
15 LLM Challenges 1. Data Privacy Naturally, one of the biggest concerns that a user of an LLM like "ChatGPT" has is that of Data Privacy. Some common questions users often have are: Is data submitted used to train and improve the model? How long is user data stored on servers? How are the companies behind these chatbots complying with GDPR / CCPA / HIPAA laws? How much "Personal Identifiable…
Keep Reading →
In this post, we will look at the key results from a paper published by a group of Google researchers titled "Everyone wants to do the model work, not the data work: Data Cascades in High-Stakes AI". So why is this a big deal and why should you pay attention? Let's look at some examples of real-world data cascades quoted directly from the paper: "Eye disease detection models, trained on noise…
Keep Reading →
Gartner's 4 Stage Data Maturity Model Gartner's 4-Stage Data Maturity Model This is a simplistic model that is focused solely on an organization's . There is a lot to unpack from this simple graph: and are about the . and are about the . Value lies in predicting the future based on past data but this is also error-prone and hard. Let's take a finance example where one is analyzing the…
Keep Reading →
Background Streamlit is a very popular open source framework that pitches itself as a pure python framework to build and share Data Web Apps in minutes with no front-end experience needed. Snowflake - a popular cloud computing company acquired Streamlit in March of 2022 for $800 million. A closer look at this acquisition gives us some insights about the merits of the framework, the future…
Keep Reading →
Five Key Ideas About Large Language Models 1. Biomimicry Biomimicry is the practice of imitating life. It involves looking to nature for inspiration and direction to solve complex human problems. So why does this work? Well, if you think about it, nature has been constantly evolving ever since life first appeared on earth some 3.8 billion years ago. Can there be a better and proven source of…
Keep Reading →
Polya Problem Solving Framework When presented with any problem, it is very natural to go head-on into problem-solving mode. However, this is not always the most optimal strategy. Here's why: You may be solving a problem that has already been solved efficiently. You may not be aware of the second order effects and side-effects of your solution. You may be solving the wrong problem. Your…
Keep Reading →
So why exactly is JSON so popular? JSON (JavaScript Object Notation) has several advantages as seen below. JSON Advantages JSON originated from JavaScript object literals as defined by the ECMAScript Programming Language Standard. The ECMAScript standard facilitated interoperability of web pages across different web browsers. Consequently, JSON quickly became the de-facto data interchange format…
Keep Reading →
The International Data Corporation (IDC) forecasts 175 zettabytes of data will be created by 2025. One zettabyte is equal to one sextillion bytes, that is, 10^21 (1,000,000,000,000,000,000,000) bytes, which is a trillion gigabytes. 175 zettabytes of data by 2025 Can you pause for a second and at least attempt to quantify or visualize this in your mind? As a comparison point, the Library of…
Keep Reading →
Ever since its November 2022 release, ChatGPT has taken the world by storm. Given the fast rate of change in this space, one might find it hard to keep up with the pace of iterative development. This is especially true for those who are new to this field or are out of touch with the progress in this domain. While there have been many important contributions that have made ChatGPT a reality…
Keep Reading →
PPDAC Problem-Solving Cycle The key thing to observe in the image above is that the i.e. we iterate through the entire PPDAC cycle as well as within each step of the cycle. The framework was developed by R. J. MacKay and R. W. Oldford. Most recently it was popularized by David Spiegelhalter in his book "The Art of Statistics: Learning from Data". Problem As always the first step is to…
Keep Reading →
Gestalt principles can be ranked in order of strongest influence to weakest influence as follows i.e. Enclosure > Connection > Proximity > Similarity Gestalt Principles Order Similarity The similarity principle uses a common feature such as . In the following generic example, take a moment to closely observe how "Shape" and "Color" are used to create two groups. Gestalt Similarity Gestalt…
Keep Reading →
This post aims to answer some common questions related to the open-source Parquet file format such as: What are Parquet files? What are the benefits of using the Parquet file format? How does the Parquet file format compare with other popular formats such as CSV and JSON? Should you use Parquet files for Data Science? What are Parquet files? Parquet file format offers very efficient compression…
Keep Reading →
Well now that we have your attention, hopefully we can introduce you to the fascinating field of Social Science which might offer some insights about the central question in this post - . Social Science While the field of Social Science is very broad and has been shaped by contributions from a wide array of academic disciplines, in this post we would like to draw attention to the work of two…
Keep Reading →
Summarized below is a list of ten simple rules published by a group of senior statisticians. Ten Simple Rules For Effective Statistical Practice The nice part about these rules is the emphasis on non-technical issues which are easy to understand even for those with no formal background in statistics. The hard part about these rules is in how to apply these "simple rules" to your use case? Well…
Keep Reading →