• Hideakikarate@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    39
    arrow-down
    4
    ·
    edit-2
    11 hours ago

    However, these existing efforts have some major issues:

    Over-focus on the most popular artists. There is a long tail of music which only gets preserved when a single person cares enough to share it. And such files are often poorly seeded.

    Later…

    We primarily used Spotify’s “popularity” metric to prioritize tracks. View the top 10,000 most popular songs in this HTML file (13.8MB gzipped).

    I must be kinda stupid, but it sounds to me like there’s some double speak. “Only popular music gets preserved, so we preserved music by popularity”

    • Kaul@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      19
      ·
      edit-2
      7 hours ago

      It’d probably be more beneficial to read the article directly from Anna’s Archive where they display plenty of graphs and infographics to make the data understandable. Unfortunately this article has none of that. The “over-focus on popular artists” is quite literally meaning they’re only missing artists who aren’t being listened to, most of which are probably AI anyway.

      https://annas-archive.li/blog/backing-up-spotify.html

    • Lojcs@piefed.social
      link
      fedilink
      English
      arrow-up
      32
      arrow-down
      1
      ·
      10 hours ago

      To be fair, the 10k is just a sample. The true amount is 86 million, about a quarter of all Spotify songs.

      Put another way, for any random song a person listens to, there is a 99.6% likelihood that it is part of the archive. We expect this number to be higher if you filter to only human-created songs. Do remember though that the error bar on listens for popularity 0 is large.

      For popularity=0, we ordered tracks by a secondary importance metric based on artist followers and album popularity, and fetched in descending order.

      We have stopped here due to the long tail end with diminishing returns (700TB+ additional storage for minor benefit), as well as the bad quality of songs with popularity=0 (many AI generated, hard to filter).

      Also it sounds like they had difficulty scraping some of the less popular songs and got them from somewhere else.