How much data does food require in the United States? Well, let’s find out. I’ll be using C, since it’s the lowest-level language many programmers are comfortable with, and it is plenty efficient enough to handle this task.
The data required for one label can be expressed in a C struct with 42 data items, including everything from vitamins, to brand name, to serving size. These are all the data points that are found on Nutrition Facts labels that cannot be reasonably calculated otherwise. This data structure is a type that is used to contain and manage many variables that contain the same structure of data.
I calculate 224 bytes per nutrition label using the C sizeof() function to account for language-dependent padding:
50 letters each for food and brand names (100 bytes)
10 letters for serving size and total units (“container” being the longest) (20 bytes)
14 32-bit floating point numbers for decimal-significant numbers (56 bytes)
2 32-bit floating point numbers for servings, to convert to fractions on the label (8 bytes)
24 unsigned 16-bit integers (unsigned shorts) for numbers between 0-65,535 (48 bytes)
You might be wondering where the calories have gone. The macros fat, protein, and carbs will be used to calculate calories. Protein and carbs each account for 4 calories per gram, while fat is 9 calories per gram. Percentage of daily values will also be calculated, especially since they are prone to change over time.
So then, how many labels are needed? Open Food Facts maintains a collaborative database of 347,507 foods in the United States alone, at the time of writing. This should account for about 20 years’ worth of food, seeing as the USDA records about 20,000 new foods per year.
347,507 labels at 224 bytes each equals 77,841,568 bytes, or 74.24 megabytes of data! Is that more or less than you expected?
How and why I created Wen: Chinese Character practice.
On a train ride from Hangzhou to Guilin, China, I began writing a program to help learn Chinese characters. The idea is simple: present a character on the screen, with or without Pinyin, and the user will input the translation. The program presents a score at the end.
This original Python iteration uses dictionary files with comma-separated components: the Chinese character, the Pinyin with tone numbers, and the definition separated by forward slashes (“/”). An example would be:
零, li2ng, 0/zero
The program then splits up each section, replaces the tone numbers with diacritics (i.e. “líng”), and compares user input to each instance of the definitions. You can look at the code for this original command-line version here.
However, this is the year 2021, and users, including myself, expect an application like this to have a graphical interface, perhaps even over the web. I would even venture to say that some among us would even launch the program on their phones. Ask no more, the online, user-friendly version can be found here, and a live version is hosted here.
Rendered in the beautiful default browser font, this program is lightning-fast and just sips data at 2-3 kilobytes per chapter (excluding the > 500 word cumulative exam, clocking in at a whopping 57 kilobytes).
All jokes aside, I have worked on larger projects, and for longer, but Wen has been the one I am most proud of. Chinese is a notoriously hard language to learn, and after studying for more than 5 years, I still have light-years to go. I personally use Wen multiple times per week, and have found it to work better for rote memorization than anything else, including flashcards. I think this is because of the need for input, further securing my memory through typing.