Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drone stops working after some time (connection reset on the server) #2430

Closed
r3pek opened this issue Jun 5, 2018 · 1 comment
Closed

Drone stops working after some time (connection reset on the server) #2430

r3pek opened this issue Jun 5, 2018 · 1 comment

Comments

@r3pek
Copy link

r3pek commented Jun 5, 2018

OK, I'm opening the bug because Itried every other way to get some help.

I'm using Drone CI to build my docker images directly from my commits from my Gitea Instance. Everything works just fine if I recently started drone and it's agents right before the commit happens. If I let Drone keep running, say, for 20 min (sometimes even less), it does see the new commits being made but the build process doesn't start at all. I have to restart the whole drone docker stack and then it just starts building the pending job.
Nothing shows up in the logs when the new "job" is detected and the build stays in the pending state. But, after 4 hours I get this on the server logs:

2018-06-04T23:59:40.632060431Z [GIN-debug] GET    /healthz                  --> github.com/drone/drone/server.Health (12 handlers),
2018-06-05T03:59:40.940891899Z INFO: 2018/06/05 03:59:40 transport: http2Server.HandleStreams failed to read frame: read tcp 10.0.4.5:9000->10.0.4.8:32880: read: connection reset by peer,
2018-06-05T03:59:40.940963739Z INFO: 2018/06/05 03:59:40 grpc: Server.processUnaryRPC failed to write status stream error: code = Canceled desc = "context canceled",
2018-06-05T03:59:41.663779461Z INFO: 2018/06/05 03:59:41 transport: http2Server.HandleStreams failed to read frame: read tcp 10.0.4.5:9000->10.0.4.7:35628: read: connection reset by peer,
2018-06-05T03:59:41.664026034Z INFO: 2018/06/05 03:59:41 grpc: Server.processUnaryRPC failed to write status stream error: code = Canceled desc = "context canceled"

It won't run any job way sooner than that though.

This is my docker-compose.yml file:

version: '3.3'

services:
    server:
        image: drone/drone
        depends_on:
            - traefik_dameon
        volumes:
            - type: volume
              source: data
              target: /var/lib/drone/
              volume:
                  nocopy: true
        deploy:
            replicas: 1
            restart_policy:
                condition: on-failure
                delay: 5s
                max_attempts: 3
                window: 120s
            labels:
            - DRONE_OPEN=true
            - DRONE_HOST=https://xxxxxxxxxxxxxx
            - DRONE_SECRET=87710e922a8215c7
            - DRONE_DATABASE_DRIVER=mysql
            - DRONE_DATABASE_DATASOURCE=xxxx:xxxx@tcp(db:3306)/drone?parseTime=true
            - DRONE_GITEA=true
            - DRONE_GITEA_URL=https://xxxxxxxxxxxxx
            - DRONE_ADMIN=xxxxxxxxx
        networks:
            - default
            - traefik_net
            - mariadb_net

    agent:
        image: drone/agent
        depends_on:
            - server
        command: agent
        deploy:
            mode: global
            restart_policy:
                condition: on-failure
                delay: 5s
                max_attempts: 3
                window: 120s
            update_config:
                parallelism: 1
                delay: 20s
        volumes:
            - /var/run/docker.sock:/var/run/docker.sock
        environment:
            - DRONE_SERVER=drone_server:9000
            - DRONE_SECRET=87710e922a8215c7
            - DRONE_DEBUG=true
            - DRONE_PLUGIN_PULL=true
        networks:
            - default

volumes:
    data:
        driver_opts:
            type: "nfs"
            o: "addr=xxxxxxxxxxxxxxx,rw,sec=sys,rw,async,soft,timeo=150,retrans=3"
            device: ":/data/docker/drone"

networks:
    traefik_net:
        external: true
    mariadb_net:
        external: true
    default:
        driver: overlay

2 node docker swarm running on 18.05.0-ce.

@bradrydzewski
Copy link

bradrydzewski commented Jun 5, 2018

read: connection reset by peer is encountered when something is breaking the http2 connection between the agent and server. We already have an open issue to replace grpc with something more reliable, so we do not need another github issue to track this.

OK, I'm opening the bug because Itried every other way to get some help.

I see you opened a thread in the Reddit forum just 3 hours ago. Please be more patient and allow adequate time for your questions to be answered in Reddit, especially considering your reddit post was created at 4am pacific time when people are sleeping. If this is an urgent matter and you require 24/7 support and a 2 hour response SLA please consider purchasing enterprise support.

@harness harness locked and limited conversation to collaborators Jun 5, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants