Today I Learned: Storage expensive, Data priceless

We have a combination Plex Media/Minecraft/Archive server that we’ve had since we purchased our first 6TB Hard Drive on December 30, 2019 ($99.99 at the time). After some time we upgraded to our massive 14TB Hard Drive ($293.00 at the time) on October 16, 2021. It took a bit over a couple years to fill things up, and now we recently invested into a 16TB Hard Drive ($279.00 at purchase) to continue our storage needs.

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda3       457G  288G  169G  64% /
/dev/sda1      1014M  202M  813M  20% /boot
/dev/sdd1        13T   12T   52G 100% /mnt/usb14
/dev/sdc1       5.5T  4.3T  962G  82% /mnt/usb03

Now it’s time to get this new drive ready for usage.

Read More

Today I learned: command > script

Had to compare two files at work today. Actually, I had to compare one file to a series of files to see what data exists in both of them. This technically comes down to a LEFT JOIN where we only want left column data when it exists in the right column.

So, in writing a script in PHP it comes down to:

<?php
ini_set('MEMORY_LIMIT', '256M');
if (!file_exists($argv[1])) { die('file ' . $argv[1] . ' not found'); }
if (!file_exists($argv[2])) { die('file ' . $argv[2] . ' not found'); }
$fp = fopen($argv[1], 'rt');
$lines = [];
do {
  $line = trim(fgets($fp));
  if (strlen($line) > 0) {
    $lines[] = $line;
  }
} while (!feof($fp));
fclose($fp);
$fp = fopen($argv[2], 'rt');
do {
  $line = trim(fgets($fp));
  if (strlen($line) > 0) {
    if (in_array($line, $lines)) {
      echo "$line\n";
    }
  }
} while (!feof($fp));
fclose($fp);

This script, albeit working like a charm, takes a while with large amounts of records.

After some googling this script isn’t really necessary if you use grep correctly. You also gain the speed of an executable in one fell swoop.

$ grep -Fxf [file1] [file2]

Output is exactly the same.

Today I learned: regex > loop

In writing “quad-quad”, which is a set of four 4-letter speak-able words that can be used as a user-friendly “bookmark” into easily finding a record, I was writing a “quick” program to extract the contents of wikidatawiki-20220820-pages-articles-multistream.xml (a wikipedia dump) and came into this large delay in the following loop:

$alphas = 'qwertyuiopasdfghjklzxcvbnm ';
$newline = '';
for ($x = 0; $x < strlen($line); $x++) {
    $c = substr($line, $x, 1);
    if (strpos($alphas, $c) !== false) {
        $newline = $newline . $c;
    else {
        $newline = $newline . ' ';
    }
}

The loops main purpose is to sanitize any non-letter data by replacing unknown characters with a space for later processing. The end result would be words that I could filter down to 4-character words and tally them up.

When the program read a line around 1mb in length it would “hang” for a bit as it chewed through the data. In a nutshell 25,100,655 bytes of data would take 24m36s. It was time to optimize.

Replacing the previous with the following regex performance was increased immensely.

$newline = preg_replace('/[^a-z]/', ' ', $line);

The same amount of data took 1.892s.

Lesson: If you don’t know regexes, learn regexes.