Content classification and search

astromd@beehaw.org to Free and Open Source Software@beehaw.org – 8 points –

My small, non-profit team produces a lot of content in the form of blogs, presentations, graphics, mp3 and mp4 files. We are looking for a tool that can classify the content and allow us to search on it to find relevant information on topics. The goal is to maximize existing IP we've developed. Are any of you using any #foss tools do this? Bonus points if it supports natural language querying or generative AI.

4

I suppose you can split your content in 3 categories:

  • text
  • audio
  • image

For text, you can use Langchain which allows to get embeddings from text (read more here: https://js.langchain.com/docs/modules/data_connection/text_embedding/).

For images, you can use CLIP (this model is open source, from OpenAI). You can read more about it here: https://github.com/openai/CLIP

For audio, I don't know anything off the top of my head but you are likely to find something even open source similar to the above I mentioned.

Thanks for the suggestions. I have audio transcripts of all the mp3s.

An internal wiki like Docuwiki or wiki.js might suit your needs. Although they won't automatically categorize\classify anything, it could be a useful searchable repository (especially if you can train your team in standardizing descriptions\tags\categories\etc).

Interesting suggestion. I’ll see if there are any existing workflows along these lines.