数据清洗

主要是通过JobCleanService.java process()方法和完成JobCleanUtils代码来完成清洗

  • 书写process方法

          /**
      * 具体的实现。
      */
      public synchronized ServiceState process() {
          try {
              isTaskDone.set(false);
              // 清洗数据
              logger.info("清洗数据开始");
              //清洗所有互联网信息            
              jobDataReposity.cleanJobData();
              //进行岗位分类保存
              String table = hbaseclassify.getProperty("hbasetable");//分类
              String[] strArray = null;
              strArray = table.split(",");
              String JsonContext = ReadFile
                      .ReadFile(System.getProperty("user.dir") + "/configuration/hbaseclassify.json");
              JSONObject jsonObject;
              jsonObject = new JSONObject(JsonContext);
              //如何根据名称先分大类
              for (int i = 0; i < strArray.length; i++) {
                  JSONArray jsonArray = jsonObject.getJSONArray(strArray[i]);
                  //对保存的数据进行分类统计IT
                  //    [{"name":"云计算"},{"name":"cloud"},{"name":"openstack"},{"name":"kvm"},{"name":"vmware"},{"name":"ceph"},{"name":"sdn"},{"name":"云"},{"name":"阿里云"},{"name":"腾讯云"},{"name":"云存储"}]
                  jobDataReposity.classify(jsonArray, "job_" + strArray[i], strArray[i]);
              }
    
          } catch (Exception e) {
              logger.error(e.toString());
          }
          finally {
              isTaskDone.set(true);
          }
          logger.info("清洗数据结束");
          return ServiceState.STATE_RUNNING;
      }

    该方法是对具体的数据进行分类

  • 书写cleanJobData方法

      /** 
      *初中,高中,中技,中专,大专,本科,硕士,博士
      * @Title: ${cleanEducation} 
      * @Description: ${清洗工作学历}
      * @param ${scale:爬取的工作学历}  
      * @return ${清洗过的工作学历}
      * @throws 
      */
      public static String cleanEducation(String education) {//清洗学历 至初中,高中,中技,中专,大专,本科,硕士,博士,其他 其一
          if (education.contains("初中")) {
              return "初中";
          }
          else if(education.contains("高中")) {
              return "高中";
          }
          else if(education.contains("中专")) {
              return "中专";
          }
          else if(education.contains("中技")) {
              return "中技";
          }
          else if(education.contains("大专")) {
              return "大专";
          }
          else if(education.contains("本科")) {
              return "本科";
          }
          else if(education.contains("硕士")) {
              return "硕士";
          }
          else if(education.contains("博士")) {
              return "博士";
          }
          return "不限";
      }

    该方法是具体对学历的清洗操作

  • 启动服务

      @Override
      public ServiceState start() {
    
          //1. 启动收集服务(虫)
          //Service jobCollector = new JobCollectService(server, this);
          //jobCollector.start();
          // 2. 启动清洗服务(清洗)
          Service jobCleaner = new JobCleanService(server, this);
          jobCleaner.start();
          // 3. 启动分析服务(聚类)
          //Service jobCluster = new JobClusterService(server, this);
          //jobCluster.start();
          return ServiceState.STATE_RUNNING;
      }

    JobAnalysisService.java中只解除启动清洗服务的注解 然后运行EduInsightServer.java启动服务

Last updated