-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sitemap.xml #524
Comments
Can you please fill out the template for a feature request and edit yours accordingly? And add the information from the original proposal? As per your request, I think we need to have useful privacy options before we talk about an xml file that just contains pointers to everything from an instance. Additionally I am skeptical whether this is a good thing in the first place. In either way, I think that comments should not be present in the sitemap at all (not in the proposal, just wanted to say it) |
I updated the formatting. I'm curious about why you are skeptical about using an open standard for defining content location, priority and the frequency that the content is updated? The lack of a sitemap.xml does not determine whether the content is indexed or not. If you search https://www.google.com/search?q=kreynen+drupal and scroll down into the results, you will eventually find Kbin, Reddit and Mastodon posts. If it's public, Google will index it. This feature would give instance owners the option of influencing how often Google is indexing specific content from the instance. |
I think my hesitation comes from not really knowing a lot about it and making it a lot easier for everybody to find things they are not supposed to find. So I don't have a good reason for blocking it, cause security by obscurity is not security... |
As I'm sure you are aware, it's not a great idea to rely on obscurity for security. You can't even rely on bots to respect a robots.txt. If something is available without authentication to HTTP requests, assume it will eventually show up in a Google search. Google has a special relationship with large projects. If you scan a Drupal or WordPress site with https://pagespeed.web.dev/, you will get Drupal or WordPress specific suggestions to improve the page performance... which reduces Google's cost to index the content. Adding a Sitemap.xml has come up in Mastodon too. mastodon/mastodon#11959 points to a Python project that can generate a sitemap.xml for a Mastodon instance that uses a similar approach to what I was describing doing with the KBin/MBin API. I'm going to share more about why we want this feature in Matrix. |
my 2c braindump on this:
my rational here is that you could setup an instance where only a handful of magazines would be getting sitemaps index, or an instance where it's meant/intended to be seen and indexed, but then maybe exclude some magazines from this indexing, like those about meta discussion/reports about the instance itself or general lobby magazine, if they are interested in wanting this to be exposed at all, otherwise it shouldn't be generating anything if the instance doesn't want to be easily seen/indexed by search engine and other bots/tools. also, ideally the code should be authored by the contributor (i.e. you, if you want to submit patches), but depending on how sitemap generation is done and how expensive it could be perhaps enlisting help from an external bundle might be a decent choice (note that I'm mostly going off in any case, feel free to make a fork copy and do some experiments in the meantime, and maybe make a PR/propose the patch if/when you feel like you've got something? |
I agree with
Additionally I would add
I think we should definitely leveeage the scheduler component for this. We didn't yet build the framework to just use it, but I wanted to include it anyways |
at first I also thought of this mode for magazine, but I decided on configurable defaults for easy allowlist/blocklist mode of operation when sitemap generation is active
that'd be good too, but I didn't mention this since adding option for (local) is easy, but I have no idea on how to best enforce these for remote users posting to the local magazine |
My 2c are:
|
Is your feature request related to a problem? Please describe.
When searching for something link https://www.google.com/search?q=drupal+reservation+systems, users will often find links to Reddit ranked relatively high in the results.
Google isn't using https://www.reddit.com/sitemap.xml to find new Reddit posts. Google is treating Reddit differently than the rest of the semantic web... and will continue to do that with deals like https://www.reuters.com/technology/reddit-ai-content-licensing-deal-with-google-sources-say-2024-02-22/.
For a new community/mbin instance to compete with an existing reddit community, it has to be discoverable outside of ActivityPub clients.
Describe the solution you'd like
Adding a sitemap.xml that lists the magazines and collections on an instance is one way to improve how quickly Google and other search engines find and index content. My recommendation is to provide this as an option magazines can opt into. The root level sitemap.xml of the instance would be a sitemap xml index of the local magazines that choose to generate a sitemap.xml.
The Magazine level sitemap.xml would include the details of threads posted.
Ignoring the fact that https://kbin.social/m/drupal is hosted on kbin.social for the moment... if https://kbin.social/m/drupal was the only magazine that opted in, the root level sitemap.xml file at https://kbin.social/sitemap.xml would look like...
The magazine level sitemap.xml at https://kbin.social/m/drupal/sitemap.xml would look like...
The priority for each magazine could be calculated using pinned and votes. Changefreq would be based on replies and voting in that thread.
Describe alternatives you've considered
My interest in this request for a very specific use case, but when I started looking into this I found someone else had already opened the feature request in in https://codeberg.org/Kbin/kbin-core/issues/1305. I started looking into some of the options for generating sitemap.xml files with modern PHP/Symfony, but never got a response from the KBin community on which direction would align with the project's architecture... so now I'm asking the same questions here.
https://keeplearning.dev/generate-sitemap-in-symfony-6-6068c37225 gives a good, high-level overview of bundle vs. custom controller approaches. I know nothing about these bundles or the Mbin project's preferred approach to a feature like this, but I'm willing to volunteer a few cycles to move this forward if someone more familiar with the project is willing to point me in the right direction.
While I think I could get all the information I need to generate the sitemap.xml from instances that have the API enabled like https://kbin.melroy.org/api/magazines?p=1&perPage=48&sort=hot&federation=local&hide_adult=hide and https://kbin.melroy.org/api/magazine/25/entries?sort=hot&time=%E2%88%9E&p=1&perPage=25&usePreferredLangs=false and generate the files with a service outside the MBin codebase, that's a really inefficient way to generate those files on a low traffic instance.
Additional context
If someone points me in the right direction, I'm happy to take a stab at this.
The text was updated successfully, but these errors were encountered: