QEMU源码全解析 —— CPU虚拟化(17)

123 篇文章 36 订阅 ¥49.90 ¥99.00

接前一篇文章:

本文内容参考:

《趣谈 Linux操作系统 》 —— 刘超, 极客时间

QEMU /KVM》 源码 解析与应用 —— 李强,机械工业出版社

《深度探索 Linux 系统 虚拟化 原理与实现》—— 王柏生 谢广军, 机械工业出版社

特此致谢!

三、KVM模块初始化介绍

2. KVM模块初始化

上一回开始对于kvm_init函数进行解析。讲解了第1个函数kvm_arch_init。本回继续往下进行讲解。为了便于理解和回顾,再次贴出kvm_init的源码,在Linux内核源码根目录/virt/kvm/kvm_main.c中,代码如下:

  1. int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
  2. struct module *module)
  3. {
  4. struct kvm_cpu_compat_check c;
  5. int r;
  6. int cpu;
  7. r = kvm_arch_init(opaque);
  8. if (r)
  9. goto out_fail;
  10. /*
  11. * kvm_arch_init makes sure there's at most one caller
  12. * for architectures that support multiple implementations,
  13. * like intel and amd on x86.
  14. * kvm_arch_init must be called before kvm_irqfd_init to avoid creating
  15. * conflicts in case kvm is already setup for another implementation.
  16. */
  17. r = kvm_irqfd_init();
  18. if (r)
  19. goto out_irqfd;
  20. if (!zalloc_cpumask_var(&cpus_hardware_enabled, GFP_KERNEL)) {
  21. r = -ENOMEM;
  22. goto out_free_0;
  23. }
  24. r = kvm_arch_hardware_setup(opaque);
  25. if (r < 0)
  26. goto out_free_1;
  27. c.ret = &r;
  28. c.opaque = opaque;
  29. for_each_online_cpu(cpu) {
  30. smp_call_function_single(cpu, check_processor_compat, &c, 1);
  31. if (r < 0)
  32. goto out_free_2;
  33. }
  34. r = cpuhp_setup_state_nocalls(CPUHP_AP_KVM_STARTING, "kvm/cpu:starting",
  35. kvm_starting_cpu, kvm_dying_cpu);
  36. if (r)
  37. goto out_free_2;
  38. register_reboot_notifier(&kvm_reboot_notifier);
  39. /* A kmem cache lets us meet the alignment requirements of fx_save. */
  40. if (!vcpu_align)
  41. vcpu_align = __alignof__(struct kvm_vcpu);
  42. kvm_vcpu_cache =
  43. kmem_cache_create_usercopy("kvm_vcpu", vcpu_size, vcpu_align,
  44. SLAB_ACCOUNT,
  45. offsetof(struct kvm_vcpu, arch),
  46. offsetofend(struct kvm_vcpu, stats_id)
  47. - offsetof(struct kvm_vcpu, arch),
  48. NULL);
  49. if (!kvm_vcpu_cache) {
  50. r = -ENOMEM;
  51. goto out_free_3;
  52. }
  53. for_each_possible_cpu(cpu) {
  54. if (!alloc_cpumask_var_node(&per_cpu(cpu_kick_mask, cpu),
  55. GFP_KERNEL, cpu_to_node(cpu))) {
  56. r = -ENOMEM;
  57. goto out_free_4;
  58. }
  59. }
  60. r = kvm_async_pf_init();
  61. if (r)
  62. goto out_free_4;
  63. kvm_chardev_ops.owner = module;
  64. r = misc_register(&kvm_dev);
  65. if (r) {
  66. pr_err("kvm: misc device register failed\n");
  67. goto out_unreg;
  68. }
  69. register_syscore_ops(&kvm_syscore_ops);
  70. kvm_preempt_ops.sched_in = kvm_sched_in;
  71. kvm_preempt_ops.sched_out = kvm_sched_out;
  72. kvm_init_debug();
  73. r = kvm_vfio_ops_init();
  74. WARN_ON(r);
  75. return 0;
  76. out_unreg:
  77. kvm_async_pf_deinit();
  78. out_free_4:
  79. for_each_possible_cpu(cpu)
  80. free_cpumask_var(per_cpu(cpu_kick_mask, cpu));
  81. kmem_cache_destroy(kvm_vcpu_cache);
  82. out_free_3:
  83. unregister_reboot_notifier(&kvm_reboot_notifier);
  84. cpuhp_remove_state_nocalls(CPUHP_AP_KVM_STARTING);
  85. out_free_2:
  86. kvm_arch_hardware_unsetup();
  87. out_free_1:
  88. free_cpumask_var(cpus_hardware_enabled);
  89. out_free_0:
  90. kvm_irqfd_exit();
  91. out_irqfd:
  92. kvm_arch_exit();
  93. out_fail:
  94. return r;
  95. }
  96. EXPORT_SYMBOL_GPL(kvm_init);

kvm_init函数总体调用的函数如下图所示:

(2)kvm_irqfd_init函数

代码片段如下:

  1. /*
  2. * kvm_arch_init makes sure there's at most one caller
  3. * for architectures that support multiple implementations,
  4. * like intel and amd on x86.
  5. * kvm_arch_init must be called before kvm_irqfd_init to avoid creating
  6. * conflicts in case kvm is already setup for another implementation.
  7. */
  8. r = kvm_irqfd_init();
  9. if (r)
  10. goto out_irqfd;

该函数相关的声明在Linux内核源码根目录/include/linux/kvm_host.h中,如下:

  1. #ifdef CONFIG_HAVE_KVM_IRQFD
  2. int kvm_irqfd_init(void);
  3. void kvm_irqfd_exit(void);
  4. #else
  5. static inline int kvm_irqfd_init(void)
  6. {
  7. return 0;
  8. }
  9. static inline void kvm_irqfd_exit(void)
  10. {
  11. }
  12. #endif

可以看到,这是一个可以配置的函数。只有当配置了CONFIG_HAVE_KVM_IRQFD时,该函数才有实际内容,否则就是一个空函数。

配置有效的kvm_irqfd_init函数在Linux内核源码根目录/virt/kvm/eventfd.c中,代码如下:

  1. /*
  2. * create a host-wide workqueue for issuing deferred shutdown requests
  3. * aggregated from all vm* instances. We need our own isolated
  4. * queue to ease flushing work items when a VM exits.
  5. */
  6. int kvm_irqfd_init(void)
  7. {
  8. irqfd_cleanup_wq = alloc_workqueue("kvm-irqfd-cleanup", 0, 0);
  9. if (!irqfd_cleanup_wq)
  10. return -ENOMEM;
  11. return 0;
  12. }
  13. void kvm_irqfd_exit(void)
  14. {
  15. destroy_workqueue(irqfd_cleanup_wq);
  16. }

这里要讲解一下QEMU/KVM的irqfd机制。参考以下博文:

https://www.cnblogs.com/haiyonghao/p/14440723.html

QEMU/KVM的irqfd机制

irqfd机制与ioeventfd机制类似,其基本原理都是基于eventfd。ioeventfd机制为Guest提供了向QEMU/KVM发送通知的快捷通道(Guest -> QEMU/KVM);对应地,irqfd机制提供了QEMU/KVM向Guest发送通知的快捷通道(QEMU/KVM -> Guest)。

irqfd机制将一个eventfd与一个全局中断号联系起来,当向这个eventfd发送信号时,就会导致对应的中断注入到 虚拟机 中。

QEMU注册irqfd

与ioeventfd类似,irqfd在使用前必须先初始化一个EventNotifier对象(利用event_notifier_init函数初始化),初始化EventNotifier对象完成之后获得了一个eventfd。

向kvm发送注册中断irqfd请求

获得一个eventfd之后,QEMU通过kvm_irqchip_add_irqfd_notifier_gsi=>kvm_irqchip_assign_irqfd构造kvm_irqchip结构,并向kvm发送ioctl(KVM_IRQFD)。

  1. static int kvm_irqchip_assign_irqfd(KVMState *s, int fd, int rfd, int virq,
  2. bool assign)
  3. {
  4. struct kvm_irqfd irqfd = {
  5. .fd = fd,
  6. .gsi = virq,
  7. .flags = assign ? 0 : KVM_IRQFD_FLAG_DEASSIGN,
  8. };
  9. if (rfd != -1) {
  10. irqfd.flags |= KVM_IRQFD_FLAG_RESAMPLE;
  11. irqfd.resamplefd = rfd;
  12. }
  13. if (!kvm_irqfds_enabled()) {
  14. return -ENOSYS;
  15. }
  16. return kvm_vm_ioctl(s, KVM_IRQFD, &irqfd);
  17. }

在kvm_irqchip_assign_irqfd函数中,首先构造了一个kvm_irqfd结构的变量irqfd,其中fd为之前初始化的eventfd,gsi是全局系统中断,flags中定义了是向kvm注册irqfd(flags=0)还是解除注册irqfd(KVM_IRQFD_FLAG_DEASSIGN)(也就是flags=1)。flags的bit1(KVM_IRQFD_FLAG_RESAMPLE)表明该中断是否为电平触发。

  • KVM_IRQFD_FLAG_RESAMPLE相关信息

当中断处于沿触发模式时,irqfd->fd连接kvm中的中断芯片(irqchip)的gsi管脚,也由irqfd->fd负责中断的toggle,以及对用户空间的handler的触发。

当中断处于电平触发模式时,同样irqfd->fd连接kvm中的中断芯片的gsi管脚,当中断芯片收到一个EOI(end of interrupt)重采样信号时,gsi进行电平翻转,对用户空间的通知由irqfd->resample_fd完成(resample_fd也是一个eventfd)。

kvm_irqchip_assign_irqfd最后调用kvm_vm_ioctl(s, KVM_IRQFD, &irqfd),向kvm请求注册包含上面构造的kvm_irqfd信息的irqfd。

kvm注册irqfd

收到ioctl(KVM_IRQFD)之后,KVM首先获取传入的数据结构kvm_irqfd的信息,然后调用kvm_irqfd函数。

  1. case KVM_IRQFD: {
  2. struct kvm_irqfd data;
  3. r = -EFAULT;
  4. if (copy_from_user(&data, argp, sizeof(data)))
  5. goto out;
  6. r = kvm_irqfd(kvm, &data);
  7. break;
  8. }

在kvm_irqfd函数中,首先分辨传入的kvm_irqfd结构中的flags的bit0要求的是进行irqfd注册还是解除irqfd的注册。irqfd注册调用kvm_irqfd_assign函数;解除注册都调用kvm_irqfd_deasign函数。

  1. int
  2. kvm_irqfd(struct kvm *kvm, struct kvm_irqfd *args)
  3. {
  4. if (args->flags & ~(KVM_IRQFD_FLAG_DEASSIGN | KVM_IRQFD_FLAG_RESAMPLE))
  5. return -EINVAL;
  6. if (args->flags & KVM_IRQFD_FLAG_DEASSIGN)
  7. return kvm_irqfd_deassign(kvm, args);
  8. return kvm_irqfd_assign(kvm, args);
  9. }

kvm_irqfd_assign函数在同文件(Linux内核源码根目录/virt/kvm/eventfd.c)中,代码如下:

  1. static int
  2. kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
  3. {
  4. struct kvm_kernel_irqfd *irqfd, *tmp;
  5. struct fd f;
  6. struct eventfd_ctx *eventfd = NULL, *resamplefd = NULL;
  7. int ret;
  8. __poll_t events;
  9. int idx;
  10. if (!kvm_arch_intc_initialized(kvm))
  11. return -EAGAIN;
  12. if (!kvm_arch_irqfd_allowed(kvm, args))
  13. return -EINVAL;
  14. irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL_ACCOUNT);
  15. if (!irqfd)
  16. return -ENOMEM;
  17. irqfd->kvm = kvm;
  18. irqfd->gsi = args->gsi;
  19. INIT_LIST_HEAD(&irqfd->list);
  20. INIT_WORK(&irqfd->inject, irqfd_inject);
  21. INIT_WORK(&irqfd->shutdown, irqfd_shutdown);
  22. seqcount_spinlock_init(&irqfd->irq_entry_sc, &kvm->irqfds.lock);
  23. f = fdget(args->fd);
  24. if (!f.file) {
  25. ret = -EBADF;
  26. goto out;
  27. }
  28. eventfd = eventfd_ctx_fileget(f.file);
  29. if (IS_ERR(eventfd)) {
  30. ret = PTR_ERR(eventfd);
  31. goto fail;
  32. }
  33. irqfd->eventfd = eventfd;
  34. if (args->flags & KVM_IRQFD_FLAG_RESAMPLE) {
  35. struct kvm_kernel_irqfd_resampler *resampler;
  36. resamplefd = eventfd_ctx_fdget(args->resamplefd);
  37. if (IS_ERR(resamplefd)) {
  38. ret = PTR_ERR(resamplefd);
  39. goto fail;
  40. }
  41. irqfd->resamplefd = resamplefd;
  42. INIT_LIST_HEAD(&irqfd->resampler_link);
  43. mutex_lock(&kvm->irqfds.resampler_lock);
  44. list_for_each_entry(resampler,
  45. &kvm->irqfds.resampler_list, link) {
  46. if (resampler->notifier.gsi == irqfd->gsi) {
  47. irqfd->resampler = resampler;
  48. break;
  49. }
  50. }
  51. if (!irqfd->resampler) {
  52. resampler = kzalloc(sizeof(*resampler),
  53. GFP_KERNEL_ACCOUNT);
  54. if (!resampler) {
  55. ret = -ENOMEM;
  56. mutex_unlock(&kvm->irqfds.resampler_lock);
  57. goto fail;
  58. }
  59. resampler->kvm = kvm;
  60. INIT_LIST_HEAD(&resampler->list);
  61. resampler->notifier.gsi = irqfd->gsi;
  62. resampler->notifier.irq_acked = irqfd_resampler_ack;
  63. INIT_LIST_HEAD(&resampler->link);
  64. list_add(&resampler->link, &kvm->irqfds.resampler_list);
  65. kvm_register_irq_ack_notifier(kvm,
  66. &resampler->notifier);
  67. irqfd->resampler = resampler;
  68. }
  69. list_add_rcu(&irqfd->resampler_link, &irqfd->resampler->list);
  70. synchronize_srcu(&kvm->irq_srcu);
  71. mutex_unlock(&kvm->irqfds.resampler_lock);
  72. }
  73. /*
  74. * Install our own custom wake-up handling so we are notified via
  75. * a callback whenever someone signals the underlying eventfd
  76. */
  77. init_waitqueue_func_entry(&irqfd->wait, irqfd_wakeup);
  78. init_poll_funcptr(&irqfd->pt, irqfd_ptable_queue_proc);
  79. spin_lock_irq(&kvm->irqfds.lock);
  80. ret = 0;
  81. list_for_each_entry(tmp, &kvm->irqfds.items, list) {
  82. if (irqfd->eventfd != tmp->eventfd)
  83. continue;
  84. /* This fd is used for another irq already. */
  85. ret = -EBUSY;
  86. spin_unlock_irq(&kvm->irqfds.lock);
  87. goto fail;
  88. }
  89. idx = srcu_read_lock(&kvm->irq_srcu);
  90. irqfd_update(kvm, irqfd);
  91. list_add_tail(&irqfd->list, &kvm->irqfds.items);
  92. spin_unlock_irq(&kvm->irqfds.lock);
  93. /*
  94. * Check if there was an event already pending on the eventfd
  95. * before we registered, and trigger it as if we didn't miss it.
  96. */
  97. events = vfs_poll(f.file, &irqfd->pt);
  98. if (events & EPOLLIN)
  99. schedule_work(&irqfd->inject);
  100. #ifdef CONFIG_HAVE_KVM_IRQ_BYPASS
  101. if (kvm_arch_has_irq_bypass()) {
  102. irqfd->consumer.token = (void *)irqfd->eventfd;
  103. irqfd->consumer.add_producer = kvm_arch_irq_bypass_add_producer;
  104. irqfd->consumer.del_producer = kvm_arch_irq_bypass_del_producer;
  105. irqfd->consumer.stop = kvm_arch_irq_bypass_stop;
  106. irqfd->consumer.start = kvm_arch_irq_bypass_start;
  107. ret = irq_bypass_register_consumer(&irqfd->consumer);
  108. if (ret)
  109. pr_info("irq bypass consumer (token %p) registration fails: %d\n",
  110. irqfd->consumer.token, ret);
  111. }
  112. #endif
  113. srcu_read_unlock(&kvm->irq_srcu, idx);
  114. /*
  115. * do not drop the file until the irqfd is fully initialized, otherwise
  116. * we might race against the EPOLLHUP
  117. */
  118. fdput(f);
  119. return 0;
  120. fail:
  121. if (irqfd->resampler)
  122. irqfd_resampler_shutdown(irqfd);
  123. if (resamplefd && !IS_ERR(resamplefd))
  124. eventfd_ctx_put(resamplefd);
  125. if (eventfd && !IS_ERR(eventfd))
  126. eventfd_ctx_put(eventfd);
  127. fdput(f);
  128. out:
  129. kfree(irqfd);
  130. return ret;
  131. }

在kvm_irqfd_assign函数中,首先申请了一个kvm_kernel_irqfd结构类型的变量irqfd,并为之分配空间,之后对irqfd的各子域进行赋值。代码片段如下:

  1. irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL_ACCOUNT);
  2. if (!irqfd)
  3. return -ENOMEM;
  4. irqfd->kvm = kvm;
  5. irqfd->gsi = args->gsi;
  6. INIT_LIST_HEAD(&irqfd->list);
  7. INIT_WORK(&irqfd->inject, irqfd_inject);
  8. INIT_WORK(&irqfd->shutdown, irqfd_shutdown);
  9. seqcount_spinlock_init(&irqfd->irq_entry_sc, &kvm->irqfds.lock);

kvm_kernel_irqfd结构中有2个work_struct,inject和shutdown,分别负责触发中断和关闭中断,这两个work_struct各自对应的操作函数分别为irqfd_inject和irqfd_shutdown。

kvm_irq_assign调用init_waitqueue_func_entry函数将 irqfd_wakeup 函数注册为irqfd中 等待队列entry 激活时的处理函数。 这样任何写入该irqfd对应的eventfd的行为都将导致触发这个函数。

然后,kvm_irq_assign函数利用init_poll_funcptr将irqfd_ptable_queue_proc函数注册为irqfd中的poll table的处理函数。 irqfd_ptable_queue_proc会将poll table中对应的wait queue entry加入到waitqueue中去 。代码片段如下:

  1. /*
  2. * Install our own custom wake-up handling so we are notified via
  3. * a callback whenever someone signals the underlying eventfd
  4. */
  5. init_waitqueue_func_entry(&irqfd->wait, irqfd_wakeup);
  6. init_poll_funcptr(&irqfd->pt, irqfd_ptable_queue_proc);

kvm_irq_assign接着判断该eventfd是否已经被其它中断使用。代码片段如下:

  1. ret = 0;
  2. list_for_each_entry(tmp, &kvm->irqfds.items, list) {
  3. if (irqfd->eventfd != tmp->eventfd)
  4. continue;
  5. /* This fd is used for another irq already. */
  6. ret = -EBUSY;
  7. spin_unlock_irq(&kvm->irqfds.lock);
  8. goto fail;
  9. }

kvm_irq_assign函数以irqfd->pt为参数,调用eventfd的poll函数,也就是eventfd_poll,后者会调用poll_wait函数,也就是之前为poll table注册的irqfd_ptable_queue_proc函数。irqfd_ptable_queue_proc将irqfd->wait加入到了eventfd的wqh等待队列中。这样,当有其它进程或者内核对eventfd进行write时,就会导致eventfd的wqh等待队列上的对象函数得到执行,也就是irqfd_wakeup函数。

这里只讨论有数据,即flgas中的EPOLLIN置位时,会调用kvm_arch_set_irq_inatomic进行中断注入。

kvm_arch_set_irq_inatomic =>

kvm_set_msi_irq =>

kvm_irq_delivery_to_apic_fast

如果kvm_arch_set_irq_inatomic无法注入中断(即非MSI中断或非HV_SINT中断),那么就调用irqfd->inject,即调用irqfd_inject函数。

  1. static void irqfd_inject(struct work_struct *work)
  2. {
  3. struct kvm_kernel_irqfd *irqfd =
  4. container_of(work, struct kvm_kernel_irqfd, inject);
  5. struct kvm *kvm = irqfd->kvm;
  6. if (!irqfd->resampler) {
  7. kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1,
  8. false);
  9. kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0,
  10. false);
  11. } else
  12. kvm_set_irq(kvm, KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID,
  13. irqfd->gsi, 1, false);
  14. }

在irqfd_inject函数中,如果该irqfd配置的中断为边沿触发,则调用两次kvm_set_irq,形成一个中断脉冲,以便KVM中的中断芯片(irqchip)能够感知到这个中断。如果该irqfd配置的中断为电平触发,则调用一次kvm_set_irq,将中断拉至高电平,使irqchip感知到,电平触发的中断信号拉低动作会由后续的irqchip的EOI触发。

总结

irqfd基于eventfd机制,qemu中将一个gsi(全局系统中断号)与eventfd捆绑后,向KVM发送注册irqfd请求。KVM收到请求后将带有gsi信息的eventfd加入到与irqfd有关的等待队列中,一旦有进程向该eventfd写入,等待队列中的元素就会唤醒,并调用相应唤醒函数(irqfd_wakeup)向Guest注入中断,而 注入中断 这一步骤相关知识与特定的中断芯片如PIC、APIC有关。

至此,kvm_arch_init函数的第2个函数kvm_irqfd_init函数就解析完了。下一回继续往下解析kvm_arch_init函数。

举报

选择你想要举报的内容(必选)
  • 内容涉黄
  • 政治相关
  • 内容抄袭
  • 涉嫌广告
  • 内容侵权
  • 侮辱谩骂
  • 样式问题
  • 其他