Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sitemap.xml #524

Open
kreynen opened this issue Feb 23, 2024 · 8 comments
Open

Add sitemap.xml #524

kreynen opened this issue Feb 23, 2024 · 8 comments
Labels
enhancement New feature or request good first issue Good for newcomers needs feedback Requires a greater consensus to make an informed decision

Comments

@kreynen
Copy link

kreynen commented Feb 23, 2024

Is your feature request related to a problem? Please describe.

When searching for something link https://www.google.com/search?q=drupal+reservation+systems, users will often find links to Reddit ranked relatively high in the results.

Screenshot 2024-02-26 at 8 25 12 AM

Google isn't using https://www.reddit.com/sitemap.xml to find new Reddit posts. Google is treating Reddit differently than the rest of the semantic web... and will continue to do that with deals like https://www.reuters.com/technology/reddit-ai-content-licensing-deal-with-google-sources-say-2024-02-22/.

For a new community/mbin instance to compete with an existing reddit community, it has to be discoverable outside of ActivityPub clients.

Describe the solution you'd like

Adding a sitemap.xml that lists the magazines and collections on an instance is one way to improve how quickly Google and other search engines find and index content. My recommendation is to provide this as an option magazines can opt into. The root level sitemap.xml of the instance would be a sitemap xml index of the local magazines that choose to generate a sitemap.xml.

The Magazine level sitemap.xml would include the details of threads posted.

Ignoring the fact that https://kbin.social/m/drupal is hosted on kbin.social for the moment... if https://kbin.social/m/drupal was the only magazine that opted in, the root level sitemap.xml file at https://kbin.social/sitemap.xml would look like...

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://kbin.social/m/drupal/sitemap.xml</loc>
  </sitemap>
</sitemapindex>

The magazine level sitemap.xml at https://kbin.social/m/drupal/sitemap.xml would look like...

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://kbin.social/m/drupal/t/814091/Following-Kbin-communities-from-Mastodon-is-as-easy-as-searching</loc>
    <lastmod>2024-02-04</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.8</priority>
  </url>
  <url>
    <loc>https://kbin.social/m/drupal/t/860608/Ways-to-Optimize-Carousel-Sliders-in-Drupal-for-Faster-Page</loc>
    <lastmod>2024-02-26</lastmod>
    <changefreq>daily</changefreq>
    <priority>0.1</priority>
  </url>
  <url>
    <loc>https://kbin.social/m/drupal/t/855307/The-Essential-Drupal-Commerce-Modules-for-building-Online-Stores</loc>
    <lastmod>2024-02-24</lastmod>
    <changefreq>daily</changefreq>
    <priority>0.5</priority>
  </url>
</urlset>

The priority for each magazine could be calculated using pinned and votes. Changefreq would be based on replies and voting in that thread.

Describe alternatives you've considered

My interest in this request for a very specific use case, but when I started looking into this I found someone else had already opened the feature request in in https://codeberg.org/Kbin/kbin-core/issues/1305. I started looking into some of the options for generating sitemap.xml files with modern PHP/Symfony, but never got a response from the KBin community on which direction would align with the project's architecture... so now I'm asking the same questions here.

https://keeplearning.dev/generate-sitemap-in-symfony-6-6068c37225 gives a good, high-level overview of bundle vs. custom controller approaches. I know nothing about these bundles or the Mbin project's preferred approach to a feature like this, but I'm willing to volunteer a few cycles to move this forward if someone more familiar with the project is willing to point me in the right direction.

While I think I could get all the information I need to generate the sitemap.xml from instances that have the API enabled like https://kbin.melroy.org/api/magazines?p=1&perPage=48&sort=hot&federation=local&hide_adult=hide and https://kbin.melroy.org/api/magazine/25/entries?sort=hot&time=%E2%88%9E&p=1&perPage=25&usePreferredLangs=false and generate the files with a service outside the MBin codebase, that's a really inefficient way to generate those files on a low traffic instance.

Additional context

If someone points me in the right direction, I'm happy to take a stab at this.

@kreynen kreynen added the enhancement New feature or request label Feb 23, 2024
@BentiGorlich
Copy link
Member

Can you please fill out the template for a feature request and edit yours accordingly? And add the information from the original proposal?

As per your request, I think we need to have useful privacy options before we talk about an xml file that just contains pointers to everything from an instance. Additionally I am skeptical whether this is a good thing in the first place. In either way, I think that comments should not be present in the sitemap at all (not in the proposal, just wanted to say it)

@BentiGorlich BentiGorlich added more information needed requires more info to solve needs feedback Requires a greater consensus to make an informed decision and removed more information needed requires more info to solve labels Feb 26, 2024
@kreynen
Copy link
Author

kreynen commented Feb 26, 2024

I updated the formatting. I'm curious about why you are skeptical about using an open standard for defining content location, priority and the frequency that the content is updated? The lack of a sitemap.xml does not determine whether the content is indexed or not.

If you search https://www.google.com/search?q=kreynen+drupal and scroll down into the results, you will eventually find Kbin, Reddit and Mastodon posts. If it's public, Google will index it. This feature would give instance owners the option of influencing how often Google is indexing specific content from the instance.

Screenshot 2024-02-26 at 10 14 39 AM

@BentiGorlich
Copy link
Member

BentiGorlich commented Feb 28, 2024

I think my hesitation comes from not really knowing a lot about it and making it a lot easier for everybody to find things they are not supposed to find. So I don't have a good reason for blocking it, cause security by obscurity is not security...
Just 2 hints: Lemmy has a sitemap, though not a very extensive one, Mastodon does not

@kreynen
Copy link
Author

kreynen commented Feb 28, 2024

As I'm sure you are aware, it's not a great idea to rely on obscurity for security. You can't even rely on bots to respect a robots.txt. If something is available without authentication to HTTP requests, assume it will eventually show up in a Google search.

Google has a special relationship with large projects. If you scan a Drupal or WordPress site with https://pagespeed.web.dev/, you will get Drupal or WordPress specific suggestions to improve the page performance... which reduces Google's cost to index the content.

Adding a Sitemap.xml has come up in Mastodon too. mastodon/mastodon#11959 points to a Python project that can generate a sitemap.xml for a Mastodon instance that uses a similar approach to what I was describing doing with the KBin/MBin API.

I'm going to share more about why we want this feature in Matrix.

@asdfzdfj
Copy link
Contributor

asdfzdfj commented Feb 29, 2024

my 2c braindump on this:

  • for instance operator, this process MUST be opt in, and should have at least the following options:
    • provide option to allow sitemap generation at all
    • provide option to allow sitemap generation on magazine by default
    • provide option to allow sitemap generation on magazine contents by default
  • for each magazine:
    • provide options to allow sitemap generation for the magazine and its contents that may go against the instance defaults described by the second/third option above
    • remote mags shouldn't be included in sitemap generation
    • random should also be excluded from sitemap generation, despite the instance defaults

my rational here is that you could setup an instance where only a handful of magazines would be getting sitemaps index, or an instance where it's meant/intended to be seen and indexed, but then maybe exclude some magazines from this indexing, like those about meta discussion/reports about the instance itself or general lobby magazine, if they are interested in wanting this to be exposed at all, otherwise it shouldn't be generating anything if the instance doesn't want to be easily seen/indexed by search engine and other bots/tools.

also, ideally the code should be authored by the contributor (i.e. you, if you want to submit patches), but depending on how sitemap generation is done and how expensive it could be perhaps enlisting help from an external bundle might be a decent choice (note that I'm mostly going off presta/sitemap-bundle for now since the other one appears to be archived, but I'm quite interested in dumped sitemaps functionality that could be periodically updated, if live sitemap generation by custom controller could become an expensive operation)

in any case, feel free to make a fork copy and do some experiments in the meantime, and maybe make a PR/propose the patch if/when you feel like you've got something?

@BentiGorlich
Copy link
Member

I agree with

  • don't include random
  • give magazines options (though I am for opt out in this case, but a configurable global default is good as well)
  • exclude remote content

Additionally I would add

  • include users who opt in
  • exclude the instance sites like the about page, etc., as there are often private addresses from the instance owner present

I think we should definitely leveeage the scheduler component for this. We didn't yet build the framework to just use it, but I wanted to include it anyways

@asdfzdfj
Copy link
Contributor

  • give magazines options (though I am for opt out in this case, but a configurable global default is good as well)

at first I also thought of this mode for magazine, but I decided on configurable defaults for easy allowlist/blocklist mode of operation when sitemap generation is active

  • include users who opt in

that'd be good too, but I didn't mention this since adding option for (local) is easy, but I have no idea on how to best enforce these for remote users posting to the local magazine

@melroy89
Copy link
Member

melroy89 commented Mar 2, 2024

Adding a Sitemap.xml has come up in Mastodon too. mastodon/mastodon#11959 points to a Python project that can generate a sitemap.xml for a Mastodon instance that uses a similar approach to what I was describing doing with the KBin/MBin API.

My 2c are:

  • All of the above already mentioned by @asdfzdfj & @BentiGorlich

  • Try to add the (generated) sitemap on the root path (/sitemap.xml). And point to other sitemaps from there if needed.

  • Do NOT use our APIs for creating a sitemap.xml. Like you said, it's not the most efficient way. If you want to generate a sitemap, use PHP just like the rest of the project and you can leverage internal methods to retrieve only data you really need. You can also write dedicated queries/DTO to retrieve data from the database.

  • Cache the sitemap.xml internally for a certain period of time, so if I would call the sitemap.xml 10 times after each other it has no impact. Do not re-generate the sitemap every-time from scratch. This will cause most likely too much load and unnecessary resources from the server-side otherwise.

  • Limit the max. results on the (sub) sitemaps.xml. Eg. limit in the amount of records retrieved (eg. a hard DB limit) and/or in time (eg. not more than several months/years back?). This will improve performance and also makes it more relevant for search engines.

  • We are not fully SEO optimized. Meaning sitemap.xml is a good start (I also generated them for my site, like my blog), but most likely there are other next steps to improve getting indexed by search engines like Google. Just saying, this will be out of scope for now of course.

@melroy89 melroy89 added the good first issue Good for newcomers label Sep 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers needs feedback Requires a greater consensus to make an informed decision
Projects
None yet
Development

No branches or pull requests

4 participants